BartPho-Syllable - Vietnamese Diacritic Restoration (Full Fine-tune)
Model Details
Model Description
This model is a full fine-tuned version of vinai/bartpho-syllable specifically designed for Vietnamese Diacritic Restoration.
The model focuses exclusively on restoring missing Vietnamese diacritics in text written without tone marks (e.g., "trang phuc" → "trang phục"). It does not handle teencode, slang, spelling mistakes, or grammatical errors beyond diacritic restoration.
- Developed by: Thanh-Dan Bui, Thien-Duc Le
- Model type: Seq2Seq (Encoder-Decoder) - TFMBartForConditionalGeneration
- Language(s): Vietnamese
- License: MIT
- Finetuned from model:
vinai/bartpho-syllable
Key differences from LoRA/PEFT versions:
- Full parameter fine-tuning (all model weights updated)
- Trained using TensorFlow with TPU acceleration
- Larger dataset: ~10M training samples
Uses
Direct Use
The model takes Vietnamese text without diacritics as input and outputs text with correct tone marks restored.
Example:
- Input: "toi dang xu ly mot bai toan them dau cho tieng Viet"
- Output: "tôi đang xử lý một bài toán thêm dấu cho tiếng Việt"
Out-of-Scope Use
- General spelling correction beyond diacritics
- Teencode/slang normalization
- Grammar correction
- Translation
- Open-ended text generation
Bias, Risks, and Limitations
- Context Length: Optimized for sentences up to 128 tokens. Split long paragraphs.
- Lexical Ambiguity: "ban" → may predict "bàn"/"bạn"/"bán" based on context.
- Proper Nouns: Foreign names/abbreviations may be incorrectly altered.
How to Get Started
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, pipeline
path = "yammdd/vietnamese-diacritic-restoration-v2"
tokenizer = AutoTokenizer.from_pretrained(path)
model = TFAutoModelForSeq2SeqLM.from_pretrained(path, from_pt=False)
pipe = pipeline(
task="text2text-generation",
model=model,
tokenizer=tokenizer,
framework="tf",
)
text = "hom nay toi rat vui khi hoc xu ly ngon ngu tu nhien"
out = pipe(text, max_new_tokens=256)
print(out[0]["generated_text"])
# Output: hôm nay tôi rất vui khi học xử lý ngôn ngữ tự nhiên
Training Details
Training Data
- Source: ViDiacritics dataset from Kaggle
- Size:
Split Samples Train 10,039,717 Validation 1,254,965 Test 1,254,965 - Format:
no_diacritics→with_diacritics - Max Length: 128 tokens
Training Procedure
- Full Fine-tuning (not PEFT/LoRA)
- Framework: TensorFlow + Transformers
- Hardware: TPU v5e-8 (8 replicas)
- Precision: Mixed Float32
Training Hyperparameters
| Parameter | Value |
|---|---|
| Batch Size | 64 per replica (512 total) |
| Learning Rate | 2e-5 |
| Epochs | 1 |
| Max Length | 128 tokens |
| Optimizer | AdamW with linear warmup + decay |
| Warmup Steps | 10% of total steps |
| Weight Decay | 0.01 |
Callbacks
- ModelCheckpoint (save best val_loss)
- EarlyStopping (patience=3)
- Custom GCCallback for memory management
Evaluation
Testing Setup
Evaluated on 5,000 test samples using 4-beam search with TensorFlow.
Results
Overall Performance
| Metric | Score | Note |
|---|---|---|
| Exact Match (Accuracy) | 82.86% | Perfect sentence restoration |
| BLEU | 95.29 | Excellent n-gram overlap |
| WER | 0.0535 | 5.35% word error rate |
| CER | 0.0433 | 4.33% character error rate |
Environmental Impact
- Hardware: TPU v5e-8 (8 cores)
- Training Duration: ~1 hour
- Cloud Provider: Kaggle
Framework Versions
- transformers: 4.39.0
- tensorflow: 2.18.0 (TPU)
Model Files
tf_model.h5- Full fine-tuned weightsconfig.json- Model configurationtokenizer.json- BARTpho syllable tokenizer
Note
This model achieved state-of-the-art results for Vietnamese diacritic restoration through full fine-tuning on TPU with massive dataset scaling.
- Downloads last month
- 22
Model tree for yammdd/vietnamese-diacritic-restoration-v2
Base model
vinai/bartpho-syllable