SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paper
•
2308.11466
•
Published
•
1
This is a port of Meta's SONAR text decoder from fairseq2 to HuggingFace Transformers format.
SONAR decoder converts 1024-dimensional sentence embeddings back to text. It supports 202 languages (same as NLLB-200).
pip install torch transformers sentencepiece
from sonar_transformers import SonarPipeline
pipeline = SonarPipeline()
# Translation
result = pipeline.translate(
["Hello, how are you?"],
source_lang="eng_Latn",
target_lang="rus_Cyrl"
)
print(result) # ['Здравствуйте, как дела?']
# Encode text to embeddings
embeddings = pipeline.encode(["Hello world!"], source_lang="eng_Latn")
print(embeddings.shape) # torch.Size([1, 1024])
# Decode embeddings back to text
texts = pipeline.decode(embeddings, target_lang="eng_Latn")
print(texts) # ['Hello world!']
import torch
from transformers import M2M100ForConditionalGeneration, NllbTokenizer
from transformers.modeling_outputs import BaseModelOutput
# Load model and tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("raxtemur/SONAR_200_text_decoder")
tokenizer = NllbTokenizer.from_pretrained("raxtemur/SONAR_200_text_decoder")
# Your embeddings from SONAR encoder (1024-dim vectors)
embeddings = torch.randn(1, 1024) # Replace with actual embeddings
# Prepare encoder outputs
encoder_outputs = BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1))
# Generate text
target_lang = "eng_Latn"
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang)
generated_ids = model.generate(
encoder_outputs=encoder_outputs,
forced_bos_token_id=forced_bos_token_id,
max_length=128,
num_beams=5
)
text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(text)
Tested against original fairseq2 SONAR:
| Test | Result |
|---|---|
| Encoder cosine similarity | 1.000000 |
| Decoder output match | Identical |
| Round-trip (encode→decode) | Works |
| Translation | Works |
Example outputs:
This model was converted from the original fairseq2 checkpoint using the following key mappings:
| fairseq2 | HuggingFace |
|---|---|
decoder.decoder.layers.N.encoder_decoder_attn.* |
model.decoder.layers.N.encoder_attn.* |
decoder.decoder.layers.N.ffn.inner_proj.* |
model.decoder.layers.N.fc1.* |
decoder.decoder.layers.N.ffn.output_proj.* |
model.decoder.layers.N.fc2.* |
decoder.decoder.layers.N.ffn_layer_norm.* |
model.decoder.layers.N.final_layer_norm.* |
decoder.decoder_frontend.embed.weight |
model.decoder.embed_tokens.weight |
decoder.final_proj.weight |
lm_head.weight |
Special tokens were reordered:
[pad=0, unk=1, bos=2, eos=3][bos=0, pad=1, eos=2, unk=3]Common codes:
eng_Latn - Englishrus_Cyrl - Russiandeu_Latn - Germanfra_Latn - Frenchspa_Latn - Spanishzho_Hans - Chinese (Simplified)jpn_Jpan - Japanesekor_Hang - Koreanarb_Arab - ArabicFull list: 202 languages from FLORES-200.
@article{Duquenne:2023:sonar_arxiv,
author = {Duquenne, Paul-Ambroise and Schwenk, Holger and Balikas, Georgios and others},
title = {SONAR: Sentence-Level Multimodal and Language-Agnostic Representations},
journal = {arXiv preprint arXiv:2308.11466},
year = {2023},
}
CC-BY-NC-4.0 (inherited from original SONAR)
The model weights are derived from Meta's SONAR and are licensed under CC-BY-NC-4.0. Commercial use is not permitted.
Base model
facebook/nllb-200-distilled-1.3B