Tokenizer space-merging behavior creates challenges for fine-tuning with Latin-alphabet languages

#85
by ljcamargo - opened

Summary

When fine-tuning PaddleOCR-VL for using Latin-alphabet languages, the tokenizer's space-merging behavior makes it difficult to cleanly separate prompt tokens from response tokens during training. This appears to be specific to Latin or alphabetic characters, as the tokenizer vocabulary contains many space+LatinChar merged tokens (e.g., ' Z', ' j', ' Jo') but I did not observe equivalent merged tokens for Chinese or Japanese characters.

Of course this doesn't seem to be a bug, but an undesired edge case for fine-tuning implementations that need precise masking boundaries between prompts and responses, hence workaround is needed to achieve a successful finetuning for latin alphabets at least.

Environment

  • Model: PaddleOCR-VL (latest version)
  • Task: fine-tuning with completion-only loss (masking prompt tokens)

The Issue

When formatting training data with the model's chat template:

<|begin_of_sentence|>User: <|IMAGE_START|>...<|IMAGE_END|>OCR:\nAssistant: Response text here</s>

The tokenizer produces:

[..., 92267, 93963, 1276, 3269, ...]
#    'Assistant'  ':'    ' R'   'es'
#                         ↑
#               Space merged with 'Z' as single token

(* example tokens, not actual)
For instruction fine-tuning, we want to mask (not train on) everything up to and including "Assistant:", and only train on the response text. However:

  • Token 1276 = ' R' (space + R as one indivisible unit)
  • We cannot mask the space while keeping the 'R'
  • We must either:
    • Train accepting this spurious leading space (current workaround but leads to 90% EOS at first position = empty inferences)
    • Lose the first character or add an arbitrary padding character at the begginging but leading to hallucination of the first char at inference.

Visualization

Here's what the tokenization looks like at the prompt/response boundary:

Token:    [..., 'Assistant', ':', ' Z', 'uh', 'tΔ“e', ...]
Mask:     [...,    MASK    , MASK,  βœ“  ,  βœ“  ,  βœ“  , ...]
                                   ↑
                        Boundary falls in middle of token

The first trainable token ' Z' includes the space separator that should ideally belong to the prompt formatting.

Impact on Fine-tuning

This affects fine-tuning while the model learns that responses start with a space character, which resulting on a finetunes that delivers 90% of empty responses (EOS at first position)

Hypothesis: Latin-Specific Issue

I inspected the tokenizer vocabulary and found:

  • βœ… Many space+LatinChar tokens: ' a', ' b', ' Z', ' j', etc.
  • ❌ No equivalent space+ChineseChar or space+JapaneseChar tokens

Besides it seems the tokenizer doesn't implementent any boundary special token to avoid this merge, or this is not documented as far as I researched.

Dirty Quickfixes

  • The finetuned model may still work but tweaking inference parameters setting min_new_tokens >0 and increasing temperature but with leading to an hallucinated first character in the same 90% of times which also makes unnaceptable to just remove it in general as the 10% will loose actual content.
  • Modify the tokenizer to add custom boundary tokens (complicates model distribution) and currently problematic as the most recent deployment seems to have a bug manifested when tokenizer is edited.

Workarounds

Add a custom boundary token

# Add to tokenizer
tokenizer.add_special_tokens({'additional_special_tokens': ['<|BOUNDARY|>']})
model.resize_token_embeddings(len(tokenizer))

# Modify chat template
formatted_text = formatted_text.replace("Assistant:", "Assistant:<|BOUNDARY|>")

Issues with this approach:

  • Modifies the canonical tokenizer effectively creating an unnecesary "branch"
  • May cause compatibility issues when sharing models
  • Adds complexity to the fine-tuning pipeline

Questions for the Team

  1. Is this a known consideration for fine-tuning with PaddleOCR-VL on Latin-alphabet languages?

  2. Are there recommended practices for handling the prompt/response boundary in instruction-tuning scenarios?

  3. Would it be feasible to include an official boundary token (like <|BOUNDARY|> or <|RESPONSE_START|>) in future model releases to support clean masking for instruction-tuning?

  4. Are there existing special tokens in the vocabulary that could serve this purpose? I noticed tokens like:

    • <|LOC_SEP|> (location separator)
    • <|IMAGE_SEP|> (image separator)
    • <nl> (newline?)

    Could any of these be repurposed safely for prompt/response separation?

Possible Future Enhancements

Option 1: Add official boundary token

# In future model releases, include:
<|RESPONSE_START|>  # Guaranteed separate token, never merges

Option 2: Document the necessity to split the prompt/response prior tokenization as This issue affects third-party fine-tuning tools (like Transformers and Unsloth) that implement completion-only training

Option 3: Extend chat template

  • Add optional boundary markers to the chat template

Appreciation

Thank you for developing and maintaining PaddleOCR-VL! This is an excellent model for OCR tasks. This issue is not a criticism but rather a request for guidance on best practices for fine-tuning with Latin-alphabet languages, where this edge case becomes relevant.

Any suggestions or workarounds from the team would be greatly appreciated by the community working on extending PaddleOCR-VL for diverse language applications.

ljcamargo changed discussion title from Tokenizer space-merging behavior creates challenges for instruction-tuning with Latin-alphabet languages to Tokenizer space-merging behavior creates challenges for fine-tuning with Latin-alphabet languages

Thank you for your detailed and insightful report regarding the tokenizer's space-merging behavior for Latin-alphabet languages.

We have reviewed your feedback and confirm that this is indeed a valid issue that affects the precision of masking boundaries during instruction fine-tuning. To address this, we plan to update and correct our official chat template in the next release.

Our current plan is to transition the format from Assistant: [Response] to Assistant:\n[Response]. As you pointed out, this is a widely adopted practice in the industry because the newline character (\n) is typically treated as a standalone token by the tokenizer, which effectively prevents merging with subsequent Latin characters and provides a clean boundary for loss calculation.

We truly appreciate your valuable contribution and technical suggestions.

Sign up or log in to comment