BartPho-Syllable - Vietnamese Diacritic Restoration (Full Fine-tune)

Model Details

Model Description

This model is a full fine-tuned version of vinai/bartpho-syllable specifically designed for Vietnamese Diacritic Restoration.

The model focuses exclusively on restoring missing Vietnamese diacritics in text written without tone marks (e.g., "trang phuc" → "trang phục"). It does not handle teencode, slang, spelling mistakes, or grammatical errors beyond diacritic restoration.

  • Developed by: Thanh-Dan Bui, Thien-Duc Le
  • Model type: Seq2Seq (Encoder-Decoder) - TFMBartForConditionalGeneration
  • Language(s): Vietnamese
  • License: MIT
  • Finetuned from model: vinai/bartpho-syllable

Key differences from LoRA/PEFT versions:

  • Full parameter fine-tuning (all model weights updated)
  • Trained using TensorFlow with TPU acceleration
  • Larger dataset: ~10M training samples

Uses

Direct Use

The model takes Vietnamese text without diacritics as input and outputs text with correct tone marks restored.

Example:

  • Input: "toi dang xu ly mot bai toan them dau cho tieng Viet"
  • Output: "tôi đang xử lý một bài toán thêm dấu cho tiếng Việt"

Out-of-Scope Use

  • General spelling correction beyond diacritics
  • Teencode/slang normalization
  • Grammar correction
  • Translation
  • Open-ended text generation

Bias, Risks, and Limitations

  • Context Length: Optimized for sentences up to 128 tokens. Split long paragraphs.
  • Lexical Ambiguity: "ban" → may predict "bàn"/"bạn"/"bán" based on context.
  • Proper Nouns: Foreign names/abbreviations may be incorrectly altered.

How to Get Started

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, pipeline

path = "yammdd/vietnamese-diacritic-restoration-v2"  

tokenizer = AutoTokenizer.from_pretrained(path)
model = TFAutoModelForSeq2SeqLM.from_pretrained(path, from_pt=False)

pipe = pipeline(
    task="text2text-generation",
    model=model,
    tokenizer=tokenizer,
    framework="tf",     
)

text = "hom nay toi rat vui khi hoc xu ly ngon ngu tu nhien"
out = pipe(text, max_new_tokens=256)

print(out[0]["generated_text"])
# Output: hôm nay tôi rất vui khi học xử lý ngôn ngữ tự nhiên

Training Details

Training Data

  • Source: ViDiacritics dataset from Kaggle
  • Size:
    Split Samples
    Train 10,039,717
    Validation 1,254,965
    Test 1,254,965
  • Format: no_diacriticswith_diacritics
  • Max Length: 128 tokens

Training Procedure

  • Full Fine-tuning (not PEFT/LoRA)
  • Framework: TensorFlow + Transformers
  • Hardware: TPU v5e-8 (8 replicas)
  • Precision: Mixed Float32

Training Hyperparameters

Parameter Value
Batch Size 64 per replica (512 total)
Learning Rate 2e-5
Epochs 1
Max Length 128 tokens
Optimizer AdamW with linear warmup + decay
Warmup Steps 10% of total steps
Weight Decay 0.01

Callbacks

  • ModelCheckpoint (save best val_loss)
  • EarlyStopping (patience=3)
  • Custom GCCallback for memory management

Evaluation

Testing Setup

Evaluated on 5,000 test samples using 4-beam search with TensorFlow.

Results

Overall Performance

Metric Score Note
Exact Match (Accuracy) 82.86% Perfect sentence restoration
BLEU 95.29 Excellent n-gram overlap
WER 0.0535 5.35% word error rate
CER 0.0433 4.33% character error rate

Environmental Impact

  • Hardware: TPU v5e-8 (8 cores)
  • Training Duration: ~1 hour
  • Cloud Provider: Kaggle

Framework Versions

  • transformers: 4.39.0
  • tensorflow: 2.18.0 (TPU)

Model Files

  • tf_model.h5 - Full fine-tuned weights
  • config.json - Model configuration
  • tokenizer.json - BARTpho syllable tokenizer

Note

This model achieved state-of-the-art results for Vietnamese diacritic restoration through full fine-tuning on TPU with massive dataset scaling.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yammdd/vietnamese-diacritic-restoration-v2

Finetuned
(67)
this model

Dataset used to train yammdd/vietnamese-diacritic-restoration-v2

Space using yammdd/vietnamese-diacritic-restoration-v2 1