SONAR 200 Text Decoder (HuggingFace Port)

This is a port of Meta's SONAR text decoder from fairseq2 to HuggingFace Transformers format.

Model Description

SONAR decoder converts 1024-dimensional sentence embeddings back to text. It supports 202 languages (same as NLLB-200).

Usage

With sonar_transformers library (recommended, see GitHub: SonarTransformers)

pip install torch transformers sentencepiece
from sonar_transformers import SonarPipeline

pipeline = SonarPipeline()

# Translation
result = pipeline.translate(
    ["Hello, how are you?"],
    source_lang="eng_Latn",
    target_lang="rus_Cyrl"
)
print(result)  # ['Здравствуйте, как дела?']

# Encode text to embeddings
embeddings = pipeline.encode(["Hello world!"], source_lang="eng_Latn")
print(embeddings.shape)  # torch.Size([1, 1024])

# Decode embeddings back to text
texts = pipeline.decode(embeddings, target_lang="eng_Latn")
print(texts)  # ['Hello world!']

Direct usage with transformers

import torch
from transformers import M2M100ForConditionalGeneration, NllbTokenizer
from transformers.modeling_outputs import BaseModelOutput

# Load model and tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("raxtemur/SONAR_200_text_decoder")
tokenizer = NllbTokenizer.from_pretrained("raxtemur/SONAR_200_text_decoder")

# Your embeddings from SONAR encoder (1024-dim vectors)
embeddings = torch.randn(1, 1024)  # Replace with actual embeddings

# Prepare encoder outputs
encoder_outputs = BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1))

# Generate text
target_lang = "eng_Latn"
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang)

generated_ids = model.generate(
    encoder_outputs=encoder_outputs,
    forced_bos_token_id=forced_bos_token_id,
    max_length=128,
    num_beams=5
)

text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(text)

Compatibility

Tested against original fairseq2 SONAR:

Test Result
Encoder cosine similarity 1.000000
Decoder output match Identical
Round-trip (encode→decode) Works
Translation Works

Example outputs:

  • "Hello world!" → "Hello world!" ✓
  • "This is a test sentence." → "This is a test sentence." ✓
  • eng→rus: "Hello, how are you?" → "Здравствуйте, как дела?" ✓
  • eng→deu: "Machine learning is powerful." → "Maschinelles Lernen ist mächtig." ✓

Conversion Details

This model was converted from the original fairseq2 checkpoint using the following key mappings:

fairseq2 HuggingFace
decoder.decoder.layers.N.encoder_decoder_attn.* model.decoder.layers.N.encoder_attn.*
decoder.decoder.layers.N.ffn.inner_proj.* model.decoder.layers.N.fc1.*
decoder.decoder.layers.N.ffn.output_proj.* model.decoder.layers.N.fc2.*
decoder.decoder.layers.N.ffn_layer_norm.* model.decoder.layers.N.final_layer_norm.*
decoder.decoder_frontend.embed.weight model.decoder.embed_tokens.weight
decoder.final_proj.weight lm_head.weight

Special tokens were reordered:

  • fairseq2: [pad=0, unk=1, bos=2, eos=3]
  • HuggingFace: [bos=0, pad=1, eos=2, unk=3]

Language Codes (FLORES-200)

Common codes:

  • eng_Latn - English
  • rus_Cyrl - Russian
  • deu_Latn - German
  • fra_Latn - French
  • spa_Latn - Spanish
  • zho_Hans - Chinese (Simplified)
  • jpn_Jpan - Japanese
  • kor_Hang - Korean
  • arb_Arab - Arabic

Full list: 202 languages from FLORES-200.

Citation

@article{Duquenne:2023:sonar_arxiv,
  author = {Duquenne, Paul-Ambroise and Schwenk, Holger and Balikas, Georgios and others},
  title = {SONAR: Sentence-Level Multimodal and Language-Agnostic Representations},
  journal = {arXiv preprint arXiv:2308.11466},
  year = {2023},
}

License

CC-BY-NC-4.0 (inherited from original SONAR)

The model weights are derived from Meta's SONAR and are licensed under CC-BY-NC-4.0. Commercial use is not permitted.

Acknowledgments

Downloads last month
32
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for raxtemur/SONAR_200_text_decoder

Finetuned
(21)
this model

Paper for raxtemur/SONAR_200_text_decoder