BartPho-Syllable - Vietnamese Diacritic Restoration (Full Fine-tune)

Model Details

Model Description

This model is a full fine-tuned version of vinai/bartpho-syllable specifically designed for Vietnamese Diacritic Restoration.

The model focuses exclusively on restoring missing Vietnamese diacritics in text written without tone marks (e.g., "trang phuc" → "trang phục"). It does not handle teencode, slang, spelling mistakes, or grammatical errors beyond diacritic restoration.

Developed by: Thanh-Dan Bui, Thien-Duc Le
Model type: Seq2Seq (Encoder-Decoder) - TFMBartForConditionalGeneration
Language(s): Vietnamese
License: MIT
Finetuned from model: vinai/bartpho-syllable

Key differences from LoRA/PEFT versions:

Full parameter fine-tuning (all model weights updated)
Trained using TensorFlow with TPU acceleration
Larger dataset: ~10M training samples

Uses

Direct Use

The model takes Vietnamese text without diacritics as input and outputs text with correct tone marks restored.

Example:

Input: "toi dang xu ly mot bai toan them dau cho tieng Viet"
Output: "tôi đang xử lý một bài toán thêm dấu cho tiếng Việt"

Out-of-Scope Use

General spelling correction beyond diacritics
Teencode/slang normalization
Grammar correction
Translation
Open-ended text generation

Bias, Risks, and Limitations

Context Length: Optimized for sentences up to 128 tokens. Split long paragraphs.
Lexical Ambiguity: "ban" → may predict "bàn"/"bạn"/"bán" based on context.
Proper Nouns: Foreign names/abbreviations may be incorrectly altered.

How to Get Started

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, pipeline

path = "yammdd/vietnamese-diacritic-restoration-v2"  

tokenizer = AutoTokenizer.from_pretrained(path)
model = TFAutoModelForSeq2SeqLM.from_pretrained(path, from_pt=False)

pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    framework="tf",     
)

text = "hom nay toi rat vui khi hoc xu ly ngon ngu tu nhien"
out = pipe(text, max_new_tokens=256)

print(out[0]["generated_text"])
# Output: hôm nay tôi rất vui khi học xử lý ngôn ngữ tự nhiên

Training Details

Training Data

Source: ViDiacritics dataset from Kaggle
Size:

Split Samples

Train 10,039,717

Validation 1,254,965

Test 1,254,965
Format: no_diacritics → with_diacritics
Max Length: 128 tokens

Split	Samples
Train	10,039,717
Validation	1,254,965
Test	1,254,965

Training Procedure

Full Fine-tuning (not PEFT/LoRA)
Framework: TensorFlow + Transformers
Hardware: TPU v5e-8 (8 replicas)
Precision: Mixed Float32

Training Hyperparameters

Parameter	Value
Batch Size	64 per replica (512 total)
Learning Rate	2e-5
Epochs	1
Max Length	128 tokens
Optimizer	AdamW with linear warmup + decay
Warmup Steps	10% of total steps
Weight Decay	0.01

Callbacks

ModelCheckpoint (save best val_loss)
EarlyStopping (patience=3)
Custom GCCallback for memory management

Evaluation

Testing Setup

Evaluated on 5,000 test samples using 4-beam search with TensorFlow.

Results

Overall Performance

Metric	Score	Note
Exact Match (Accuracy)	82.86%	Perfect sentence restoration
BLEU	95.29	Excellent n-gram overlap
WER	0.0535	5.35% word error rate
CER	0.0433	4.33% character error rate

Environmental Impact

Hardware: TPU v5e-8 (8 cores)
Training Duration: ~1 hour
Cloud Provider: Kaggle

Framework Versions

transformers: 4.39.0
tensorflow: 2.18.0 (TPU)

Model Files

tf_model.h5 - Full fine-tuned weights
config.json - Model configuration
tokenizer.json - BARTpho syllable tokenizer

Note

This model achieved state-of-the-art results for Vietnamese diacritic restoration through full fine-tuning on TPU with massive dataset scaling.

Downloads last month: 22

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yammdd/vietnamese-diacritic-restoration-v2

Base model

vinai/bartpho-syllable

Finetuned

(67)

this model

yammdd
/

vietnamese-diacritic-restoration-v2