embeddingmagibu-200m.png

embeddingmagibu-200m

embeddingmagibu-200m is a Turkish-focused, multilingual sentence embedding model developed through cross-lingual tokenizer surgery, teacher-model cloning, and offline embedding distillation.

This model is released as part of the artifacts for the paper:

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
M. Ali Bayram, Banu Diri, Savaş Yıldırım
arXiv: https://arxiv.org/abs/2605.29992

The model produces 768-dimensional L2-normalized sentence embeddings and supports an 8,192-token context window, making it suitable for Turkish semantic search, retrieval-augmented generation, long-document representation, clustering, classification, and semantic textual similarity tasks.

Although the model is optimized for Turkish, it was adapted from a multilingual teacher and trained with a multilingual distillation setup, making it useful for Turkish-centered multilingual NLP scenarios.

Key Features

  • Turkish-focused sentence embedding model
  • Multilingual support through a 40-language adaptation setup
  • 8,192-token context length
  • 768-dimensional normalized embeddings
  • Approximately 200M parameters
  • SentenceTransformers-compatible architecture
  • Trained through offline embedding distillation
  • Built with a Turkish-optimized multilingual tokenizer
  • Designed for semantic search, RAG, clustering, classification, and STS

Associated Paper

This model is one of the released artifacts of the following paper:

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
arXiv: https://arxiv.org/abs/2605.29992

The paper introduces an efficient adaptation pipeline for building Turkish-focused embedding models without expensive full pretraining. The pipeline consists of:

  1. Constructing a Turkish-optimized multilingual tokenizer
  2. Cloning the teacher embedding model while preserving transformer backbone weights
  3. Distilling from precomputed teacher embeddings using an offline embedding dataset

Intended Use

This model is suitable for:

  • Turkish semantic search
  • Retrieval-augmented generation
  • Long-context retrieval
  • Semantic textual similarity
  • Sentence and document embedding
  • Clustering
  • Classification using embeddings
  • Multilingual retrieval with Turkish-centered applications
  • Low-resource and morphologically rich language NLP research

Model Performance

1. Detailed TR-MTEB Results

The model was evaluated on Turkish embedding benchmark tasks. The table below summarizes the detailed task-level results.

Category Task Score
STS STSbTR 77.5
NLI SnliTr 60.8
NLI XNLI 76.0
Retrieval SquadTRRetrieval 62.3
Retrieval MSMarcoTRRetrieval 57.4
Retrieval TQuadRetrieval 79.5
Classification THYSentimentClassification 59.5
Classification TSTimelineNewsCategoryClassification 58.7
Classification Turkish75NewsClassification 90.7
Classification TurkishIronyClassification 52.6
Classification TurkishMovieSentimentClassification 71.9
Classification TurkishNewsCategoryClassification 88.8
Classification TurkishOffensiveLanguageClassification 63.9
Classification TurkishProductSentimentClassification 60.9
Clustering TurkishAbstractCorpusClustering 58.9
Clustering TurkishColumnWritingClustering 63.6
Bitext Mining WMT16BitextMining 97.1
Other ArguAnaTR 45.3
Other NFCorpusTR 10.7
Overall Average 69.5

2. Version Comparison: embeddingmagibu-200m vs embeddingmagibu-152m

The table below compares embeddingmagibu-200m with the previous embeddingmagibu-152m version on shared evaluation tasks.

Task embeddingmagibu-200m embeddingmagibu-152m Difference
Average 69.5 67.0 +2.5
STSbTR 77.5 75.1 +2.4
SnliTr 60.8 55.4 +5.4
SquadTRRetrieval 62.3 68.7 -6.4
THYSentimentClassification 59.5 51.0 +8.5
TSTimelineNewsCategoryClassification 58.7 60.8 -2.1
Turkish75NewsClassification 90.7 92.7 -2.0
TurkishAbstractCorpusClustering 58.9 61.8 -2.9
TurkishColumnWritingClustering 63.6 61.8 +1.8
TurkishIronyClassification 52.6 48.4 +4.2
TurkishMovieSentimentClassification 71.9 67.3 +4.6
TurkishNewsCategoryClassification 88.8 90.8 -2.0
TurkishOffensiveLanguageClassification 63.9 59.6 +4.3
TurkishProductSentimentClassification 60.9 59.1 +1.8
WMT16BitextMining 97.1 91.9 +5.2
XNLI 76.0 60.8 +15.2

Model Architecture

This model uses the SentenceTransformers format with the following pipeline. The max_seq_length value is set to 8192.

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: Gemma3TextModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)

Training and Adaptation Pipeline

The model was not trained from scratch. Instead, it was developed using an efficient three-stage adaptation pipeline.

1. Tokenizer Surgery

A Turkish-optimized multilingual tokenizer was constructed with a vocabulary size of 131,072 tokens.

The goal of this step was to improve Turkish tokenization efficiency while keeping multilingual capacity. This was done by pruning redundant tokens from the teacher vocabulary and incorporating multilingual tokens using frequency analysis over a balanced 40-language corpus.

2. Teacher Model Cloning

Instead of randomly initializing the full student model, the transformer backbone weights of the teacher model were preserved. A compatible embedding table was initialized for the new vocabulary through mean-composition token mapping.

This allows the student model to inherit the teacher model's representational capacity while adapting its vocabulary to Turkish and multilingual usage.

3. Offline Embedding Distillation

The student model was trained to approximate the teacher embedding space using precomputed teacher vectors.

This avoids online teacher inference during training, making the process faster, cheaper, and easier to reproduce. The distillation objective uses embedding-space similarity, such as cosine similarity, between student outputs and stored teacher embeddings.

The associated precomputed embedding dataset is available here:

Evaluation

STSbTR Results

The model was evaluated on STSbTR, a Turkish semantic textual similarity benchmark.

Model Pearson Spearman
intfloat/multilingual-e5-large-instruct 0.8275 0.8129
trmteb/turkish-embedding-model-fine-tuned 0.8215 0.8061
embeddingmagibu-200m 0.8199 0.7980
ytu-ce-cosmos/turkish-e5-large 0.8090 0.7906
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 0.7884 0.7659
google/embeddinggemma-300m 0.7391 0.7194

These results show that embeddingmagibu-200m performs strongly on Turkish semantic textual similarity and improves over the teacher model on this benchmark.

TR-MTEB Leaderboard Comparison

Rank Model Avg STS NLI Retrieval Classification Clustering Bitext Other
1 intfloat/multilingual-e5-large-instruct 72.8 81.2 52.5 72.7 73.0 51.3 56.8 84.7
2 intfloat/multilingual-e5-large 72.3 81.2 55.8 72.6 80.1 61.1 58.1 88.6
3 ytu-ce-cosmos/turkish-e5-large 72.2 80.0 54.8 70.9 76.4 50.8 58.7 84.1
4 newmindai/TurkEmbed4STS 71.4 85.5 63.7 81.0 69.9 53.7 56.0 84.6
5 google/embeddinggemma-300m 71.0 72.9 54.7 67.6 73.3 - - -
6 selmanbaysan/turkish embedding model fine tuned 70.5 78.4 63.2 80.0 58.1 51.7 57.2 80.4
7 sentence-transformers/paraphrase-multilingual-mpnet-base-v2 69.8 82.2 60.7 82.8 58.0 46.2 51.5 65.9
8 alibaba-NLP/gte-multilingual-base 69.8 80.7 60.3 75.7 68.6 56.3 56.8 81.9
9 alibayram/embeddingmagibu-200m 69.5 77.5 60.8 76.0 62.3 - 57.4 79.5
10 intfloat/multilingual-e5-base 69.5 78.4 54.0 68.8 76.9 56.0 57.1 86.9

Usage

Installation

pip install -U sentence-transformers

Basic Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "alibayram/embeddingmagibu-200m",
    trust_remote_code=True
)

sentences = [
    "Bugün hava çok güzel.",
    "Dışarısı güneşli.",
    "Uzun bağlam gerektiren çok detaylı bir hukuki veya teknik metin..."
]

embeddings = model.encode(sentences, normalize_embeddings=True)

print(embeddings.shape)  # (3, 768)

Similarity Computation

similarities = embeddings @ embeddings.T
print(similarities)

When embeddings are normalized, dot product is equivalent to cosine similarity.

Query and Document Encoding

The model supports query/document-style encoding through SentenceTransformers.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "alibayram/embeddingmagibu-200m",
    trust_remote_code=True
)

query = "Yapay zeka modellerinde distillation nedir?"

documents = [
    "Distillation, büyük bir öğretmen modelin bilgisinin daha küçük bir öğrenci modele aktarılmasıdır.",
    "Yapay zeka günümüzde çok popüler.",
]

query_embedding = model.encode_query(query)
document_embeddings = model.encode_document(documents)

scores = model.similarity(query_embedding, document_embeddings)
print(scores)

Limitations

  • Although the model supports an 8,192-token context window, very long inputs may require more GPU memory.
  • For long documents, chunking may still be useful depending on the downstream retrieval setup.
  • The model is optimized for Turkish-centered use cases, but multilingual behavior may vary across languages.
  • The quality of the distilled model depends on the teacher model and the multilingual distillation corpus.
  • Benchmark results may vary depending on evaluation settings, pooling behavior, precision, and prompt/query formatting.

Related Resources

Paper:

Dataset:

Project page:

GitHub repository:

Citation

If you use this model in academic work, please cite the associated paper:

@article{bayram2026embeddingmagibu,
  title={Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation},
  author={Bayram, M. Ali and Diri, Banu and Yıldırım, Savaş},
  journal={arXiv preprint arXiv:2605.29992},
  year={2026},
  url={https://arxiv.org/abs/2605.29992}
}

You may also cite the model repository:

@misc{embeddingmagibu_200m_2026,
  title={embeddingmagibu-200m: Long-Context Turkish Sentence Embeddings},
  author={Bayram, M. Ali},
  year={2026},
  url={https://huggingface.co/alibayram/embeddingmagibu-200m}
}

Model Card Authors and Contact

Downloads last month
196
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using magibu/embeddingmagibu-200m 2

Paper for magibu/embeddingmagibu-200m