This model has been trained on the data that was provided by turkish-nlp-suite/temiz-OSCAR and was later chunked into a smaller piece in order to lemmatize each and every word accurately. In total 300k words have been pulled from this dataset with some unfit for lemmatization or morpheme segmentation (such as non-spesifik, baba-oğul, <!--, müslime-i...), leaving 232k words in total. Google's Gemini 2.5 flash has heavily utilized in creating the dataset for the training. A custom prompt for has been created with more in-depth Turkish linguistic knowledge for further understanding and better segmentation of Turkish words' ortographic complexity. dbmdz/bert-base-turkish-uncased has been the foundational model for our fine tuning. While models that are character based such as languge-agnostic Canine-s were experimented with, phonological variations in some word roots have been a major issue (such as burun-burnu, af etmek-affetmek...). Due to the combination of these problems, we settled on a model with better foreknowledge of Turkish linguistic understanding. Seq2Seq fine tuning has been utilized with this model. Exact match accuracy has been reported as 0.885. Datasets and the codes used will later be released under LiProject. Made by Sarp Yüceyılmaz and Ali Emre Atan.

Downloads last month: 34

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for LiProject/BERT-Turkish-Lemmatization-V2

Base model

google-bert/bert-base-uncased

Finetuned

(6409)

this model

LiProject
/

BERT-Turkish-Lemmatization-V2

Model tree for LiProject/BERT-Turkish-Lemmatization-V2

Dataset used to train LiProject/BERT-Turkish-Lemmatization-V2

Space using LiProject/BERT-Turkish-Lemmatization-V2 1