This model has been trained on the data that was provided by turkish-nlp-suite/temiz-OSCAR and was later chunked into a smaller piece in order to lemmatize each and every word accurately. In total 300k words have been pulled from this dataset with some unfit for lemmatization or morpheme segmentation (such as non-spesifik, baba-oğul, <!--, müslime-i...), leaving 232k words in total. Google's Gemini 2.5 flash has heavily utilized in creating the dataset for the training. A custom prompt for has been created with more in-depth Turkish linguistic knowledge for further understanding and better segmentation of Turkish words' ortographic complexity. dbmdz/bert-base-turkish-uncased has been the foundational model for our fine tuning. While models that are character based such as languge-agnostic Canine-s were experimented with, phonological variations in some word roots have been a major issue (such as burun-burnu, af etmek-affetmek...). Due to the combination of these problems, we settled on a model with better foreknowledge of Turkish linguistic understanding. Seq2Seq fine tuning has been utilized with this model. Exact match accuracy has been reported as 0.885. Datasets and the codes used will later be released under LiProject. Made by Sarp Yüceyılmaz and Ali Emre Atan.

Downloads last month
34
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiProject/BERT-Turkish-Lemmatization-V2

Finetuned
(6409)
this model

Dataset used to train LiProject/BERT-Turkish-Lemmatization-V2

Space using LiProject/BERT-Turkish-Lemmatization-V2 1