File size: 2,406 Bytes
fae0d4d 025163e 75d481a 025163e 75d481a 025163e 3a366f4 75d481a 025163e 75d481a fae0d4d 025163e 75d481a fae0d4d 025163e fae0d4d 025163e 5a5ecbe 025163e fae0d4d 3cfd8d7 fae0d4d 75d481a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# TReconLM
TReconLM is a decoder-only transformer model for trace reconstruction of noisy DNA sequences. It is trained to reconstruct a ground-truth sequence from multiple noisy copies (traces), each independently corrupted by insertions, deletions, and substitutions.
## Model Variants
### Pretrained Models (Fixed Length)
- `model_seq_len_60.pt` (60nt)
- `model_seq_len_110.pt` (110nt)
- `model_seq_len_180.pt` (180nt)
### Pretrained Models (Variable Length)
- `model_var_len_50_120.pt` (50-120nt)
### Fine-tuned Models
- `finetuned_noisy_dna_len60.pt` (60nt, [Noisy-DNA dataset](https://www.nature.com/articles/s41467-020-19148-3))
- `finetuned_microsoft_dna_len110.pt` (110nt, [Microsoft DNA dataset](https://ieeexplore.ieee.org/abstract/document/9517821))
- `finetuned_chandak_len117.pt` (117nt, [Chandak dataset](https://doi.org/10.1109/ICASSP40776.2020.9053441))
All models support reconstruction from cluster sizes between 2 and 10.
## How to Use
Tutorial notebooks are available in our [GitHub repository](https://github.com/MLI-lab/TReconLM) under `tutorial/`:
- `quick_start.ipynb`: Run inference on synthetic datasets from HuggingFace
- `custom_data.ipynb`: Run inference on your own data or real-world datasets (Microsoft DNA, Noisy-DNA, Chandak)
The test datasets used in the notebooks can be downloaded from [Hugging Face](https://huggingface.co/datasets/mli-lab/TReconLM_datasets).
## Training Details
- Models are pretrained on synthetic data generated by sampling ground-truth sequences uniformly at random over the quaternary alphabet, and independently introducing insertions, deletions, and substitutions at each position.
- Error probabilities for insertions, deletions, and substitutions are drawn uniformly from the interval [0.01, 0.1], and cluster sizes are sampled uniformly from [2, 10].
- Models are fine-tuned on real-world sequencing data (Noisy-DNA, Microsoft, and Chandak datasets).
For full experimental details, see [our paper](http://arxiv.org/abs/2507.12927).
## Limitations
Models trained for fixed sequence lengths may perform worse on other lengths or if the test data distribution differs significantly from the training data. The variable-length model (`model_var_len_50_120.pt`) is trained with the same compute budget as our fixed-length models, so it sees less data per sequence length and may perform slightly worse for a specific fixed length.
|