Instructions to use magibu/embeddingmagibu-200m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use magibu/embeddingmagibu-200m with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("magibu/embeddingmagibu-200m") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
embeddingmagibu-200m
embeddingmagibu-200m is a Turkish-focused, multilingual sentence embedding model developed through cross-lingual tokenizer surgery, teacher-model cloning, and offline embedding distillation.
This model is released as part of the artifacts for the paper:
Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
M. Ali Bayram, Banu Diri, Savaş Yıldırım
arXiv: https://arxiv.org/abs/2605.29992
The model produces 768-dimensional L2-normalized sentence embeddings and supports an 8,192-token context window, making it suitable for Turkish semantic search, retrieval-augmented generation, long-document representation, clustering, classification, and semantic textual similarity tasks.
Although the model is optimized for Turkish, it was adapted from a multilingual teacher and trained with a multilingual distillation setup, making it useful for Turkish-centered multilingual NLP scenarios.
Key Features
- Turkish-focused sentence embedding model
- Multilingual support through a 40-language adaptation setup
- 8,192-token context length
- 768-dimensional normalized embeddings
- Approximately 200M parameters
- SentenceTransformers-compatible architecture
- Trained through offline embedding distillation
- Built with a Turkish-optimized multilingual tokenizer
- Designed for semantic search, RAG, clustering, classification, and STS
Associated Paper
This model is one of the released artifacts of the following paper:
Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
arXiv: https://arxiv.org/abs/2605.29992
The paper introduces an efficient adaptation pipeline for building Turkish-focused embedding models without expensive full pretraining. The pipeline consists of:
- Constructing a Turkish-optimized multilingual tokenizer
- Cloning the teacher embedding model while preserving transformer backbone weights
- Distilling from precomputed teacher embeddings using an offline embedding dataset
Intended Use
This model is suitable for:
- Turkish semantic search
- Retrieval-augmented generation
- Long-context retrieval
- Semantic textual similarity
- Sentence and document embedding
- Clustering
- Classification using embeddings
- Multilingual retrieval with Turkish-centered applications
- Low-resource and morphologically rich language NLP research
Model Performance
1. Detailed TR-MTEB Results
The model was evaluated on Turkish embedding benchmark tasks. The table below summarizes the detailed task-level results.
| Category | Task | Score |
|---|---|---|
| STS | STSbTR | 77.5 |
| NLI | SnliTr | 60.8 |
| NLI | XNLI | 76.0 |
| Retrieval | SquadTRRetrieval | 62.3 |
| Retrieval | MSMarcoTRRetrieval | 57.4 |
| Retrieval | TQuadRetrieval | 79.5 |
| Classification | THYSentimentClassification | 59.5 |
| Classification | TSTimelineNewsCategoryClassification | 58.7 |
| Classification | Turkish75NewsClassification | 90.7 |
| Classification | TurkishIronyClassification | 52.6 |
| Classification | TurkishMovieSentimentClassification | 71.9 |
| Classification | TurkishNewsCategoryClassification | 88.8 |
| Classification | TurkishOffensiveLanguageClassification | 63.9 |
| Classification | TurkishProductSentimentClassification | 60.9 |
| Clustering | TurkishAbstractCorpusClustering | 58.9 |
| Clustering | TurkishColumnWritingClustering | 63.6 |
| Bitext Mining | WMT16BitextMining | 97.1 |
| Other | ArguAnaTR | 45.3 |
| Other | NFCorpusTR | 10.7 |
| Overall | Average | 69.5 |
2. Version Comparison: embeddingmagibu-200m vs embeddingmagibu-152m
The table below compares embeddingmagibu-200m with the previous embeddingmagibu-152m version on shared evaluation tasks.
| Task | embeddingmagibu-200m | embeddingmagibu-152m | Difference |
|---|---|---|---|
| Average | 69.5 | 67.0 | +2.5 |
| STSbTR | 77.5 | 75.1 | +2.4 |
| SnliTr | 60.8 | 55.4 | +5.4 |
| SquadTRRetrieval | 62.3 | 68.7 | -6.4 |
| THYSentimentClassification | 59.5 | 51.0 | +8.5 |
| TSTimelineNewsCategoryClassification | 58.7 | 60.8 | -2.1 |
| Turkish75NewsClassification | 90.7 | 92.7 | -2.0 |
| TurkishAbstractCorpusClustering | 58.9 | 61.8 | -2.9 |
| TurkishColumnWritingClustering | 63.6 | 61.8 | +1.8 |
| TurkishIronyClassification | 52.6 | 48.4 | +4.2 |
| TurkishMovieSentimentClassification | 71.9 | 67.3 | +4.6 |
| TurkishNewsCategoryClassification | 88.8 | 90.8 | -2.0 |
| TurkishOffensiveLanguageClassification | 63.9 | 59.6 | +4.3 |
| TurkishProductSentimentClassification | 60.9 | 59.1 | +1.8 |
| WMT16BitextMining | 97.1 | 91.9 | +5.2 |
| XNLI | 76.0 | 60.8 | +15.2 |
Model Architecture
This model uses the SentenceTransformers format with the following pipeline. The max_seq_length value is set to 8192.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: Gemma3TextModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(4): Normalize()
)
Training and Adaptation Pipeline
The model was not trained from scratch. Instead, it was developed using an efficient three-stage adaptation pipeline.
1. Tokenizer Surgery
A Turkish-optimized multilingual tokenizer was constructed with a vocabulary size of 131,072 tokens.
The goal of this step was to improve Turkish tokenization efficiency while keeping multilingual capacity. This was done by pruning redundant tokens from the teacher vocabulary and incorporating multilingual tokens using frequency analysis over a balanced 40-language corpus.
2. Teacher Model Cloning
Instead of randomly initializing the full student model, the transformer backbone weights of the teacher model were preserved. A compatible embedding table was initialized for the new vocabulary through mean-composition token mapping.
This allows the student model to inherit the teacher model's representational capacity while adapting its vocabulary to Turkish and multilingual usage.
3. Offline Embedding Distillation
The student model was trained to approximate the teacher embedding space using precomputed teacher vectors.
This avoids online teacher inference during training, making the process faster, cheaper, and easier to reproduce. The distillation objective uses embedding-space similarity, such as cosine similarity, between student outputs and stored teacher embeddings.
The associated precomputed embedding dataset is available here:
Evaluation
STSbTR Results
The model was evaluated on STSbTR, a Turkish semantic textual similarity benchmark.
| Model | Pearson | Spearman |
|---|---|---|
| intfloat/multilingual-e5-large-instruct | 0.8275 | 0.8129 |
| trmteb/turkish-embedding-model-fine-tuned | 0.8215 | 0.8061 |
| embeddingmagibu-200m | 0.8199 | 0.7980 |
| ytu-ce-cosmos/turkish-e5-large | 0.8090 | 0.7906 |
| sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 0.7884 | 0.7659 |
| google/embeddinggemma-300m | 0.7391 | 0.7194 |
These results show that embeddingmagibu-200m performs strongly on Turkish semantic textual similarity and improves over the teacher model on this benchmark.
TR-MTEB Leaderboard Comparison
| Rank | Model | Avg | STS | NLI | Retrieval | Classification | Clustering | Bitext | Other |
|---|---|---|---|---|---|---|---|---|---|
| 1 | intfloat/multilingual-e5-large-instruct | 72.8 | 81.2 | 52.5 | 72.7 | 73.0 | 51.3 | 56.8 | 84.7 |
| 2 | intfloat/multilingual-e5-large | 72.3 | 81.2 | 55.8 | 72.6 | 80.1 | 61.1 | 58.1 | 88.6 |
| 3 | ytu-ce-cosmos/turkish-e5-large | 72.2 | 80.0 | 54.8 | 70.9 | 76.4 | 50.8 | 58.7 | 84.1 |
| 4 | newmindai/TurkEmbed4STS | 71.4 | 85.5 | 63.7 | 81.0 | 69.9 | 53.7 | 56.0 | 84.6 |
| 5 | google/embeddinggemma-300m | 71.0 | 72.9 | 54.7 | 67.6 | 73.3 | - | - | - |
| 6 | selmanbaysan/turkish embedding model fine tuned | 70.5 | 78.4 | 63.2 | 80.0 | 58.1 | 51.7 | 57.2 | 80.4 |
| 7 | sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | 69.8 | 82.2 | 60.7 | 82.8 | 58.0 | 46.2 | 51.5 | 65.9 |
| 8 | alibaba-NLP/gte-multilingual-base | 69.8 | 80.7 | 60.3 | 75.7 | 68.6 | 56.3 | 56.8 | 81.9 |
| 9 | alibayram/embeddingmagibu-200m | 69.5 | 77.5 | 60.8 | 76.0 | 62.3 | - | 57.4 | 79.5 |
| 10 | intfloat/multilingual-e5-base | 69.5 | 78.4 | 54.0 | 68.8 | 76.9 | 56.0 | 57.1 | 86.9 |
Usage
Installation
pip install -U sentence-transformers
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"alibayram/embeddingmagibu-200m",
trust_remote_code=True
)
sentences = [
"Bugün hava çok güzel.",
"Dışarısı güneşli.",
"Uzun bağlam gerektiren çok detaylı bir hukuki veya teknik metin..."
]
embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape) # (3, 768)
Similarity Computation
similarities = embeddings @ embeddings.T
print(similarities)
When embeddings are normalized, dot product is equivalent to cosine similarity.
Query and Document Encoding
The model supports query/document-style encoding through SentenceTransformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"alibayram/embeddingmagibu-200m",
trust_remote_code=True
)
query = "Yapay zeka modellerinde distillation nedir?"
documents = [
"Distillation, büyük bir öğretmen modelin bilgisinin daha küçük bir öğrenci modele aktarılmasıdır.",
"Yapay zeka günümüzde çok popüler.",
]
query_embedding = model.encode_query(query)
document_embeddings = model.encode_document(documents)
scores = model.similarity(query_embedding, document_embeddings)
print(scores)
Limitations
- Although the model supports an 8,192-token context window, very long inputs may require more GPU memory.
- For long documents, chunking may still be useful depending on the downstream retrieval setup.
- The model is optimized for Turkish-centered use cases, but multilingual behavior may vary across languages.
- The quality of the distilled model depends on the teacher model and the multilingual distillation corpus.
- Benchmark results may vary depending on evaluation settings, pooling behavior, precision, and prompt/query formatting.
Related Resources
Paper:
Dataset:
Project page:
GitHub repository:
Citation
If you use this model in academic work, please cite the associated paper:
@article{bayram2026embeddingmagibu,
title={Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation},
author={Bayram, M. Ali and Diri, Banu and Yıldırım, Savaş},
journal={arXiv preprint arXiv:2605.29992},
year={2026},
url={https://arxiv.org/abs/2605.29992}
}
You may also cite the model repository:
@misc{embeddingmagibu_200m_2026,
title={embeddingmagibu-200m: Long-Context Turkish Sentence Embeddings},
author={Bayram, M. Ali},
year={2026},
url={https://huggingface.co/alibayram/embeddingmagibu-200m}
}
Model Card Authors and Contact
- M. Ali Bayram
- Hugging Face: https://huggingface.co/alibayram
- GitHub: https://github.com/malibayram
- Downloads last month
- 196
