embeddingmagibu-200m

embeddingmagibu-200m is a Turkish-focused, multilingual sentence embedding model developed through cross-lingual tokenizer surgery, teacher-model cloning, and offline embedding distillation.

This model is released as part of the artifacts for the paper:

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
M. Ali Bayram, Banu Diri, Savaş Yıldırım
arXiv: https://arxiv.org/abs/2605.29992

The model produces 768-dimensional L2-normalized sentence embeddings and supports an 8,192-token context window, making it suitable for Turkish semantic search, retrieval-augmented generation, long-document representation, clustering, classification, and semantic textual similarity tasks.

Although the model is optimized for Turkish, it was adapted from a multilingual teacher and trained with a multilingual distillation setup, making it useful for Turkish-centered multilingual NLP scenarios.

Key Features

Turkish-focused sentence embedding model
Multilingual support through a 40-language adaptation setup
8,192-token context length
768-dimensional normalized embeddings
Approximately 200M parameters
SentenceTransformers-compatible architecture
Trained through offline embedding distillation
Built with a Turkish-optimized multilingual tokenizer
Designed for semantic search, RAG, clustering, classification, and STS

Associated Paper

This model is one of the released artifacts of the following paper:

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
arXiv: https://arxiv.org/abs/2605.29992

The paper introduces an efficient adaptation pipeline for building Turkish-focused embedding models without expensive full pretraining. The pipeline consists of:

Constructing a Turkish-optimized multilingual tokenizer
Cloning the teacher embedding model while preserving transformer backbone weights
Distilling from precomputed teacher embeddings using an offline embedding dataset

Intended Use

This model is suitable for:

Turkish semantic search
Retrieval-augmented generation
Long-context retrieval
Semantic textual similarity
Sentence and document embedding
Clustering
Classification using embeddings
Multilingual retrieval with Turkish-centered applications
Low-resource and morphologically rich language NLP research

Model Performance

1. Detailed TR-MTEB Results

The model was evaluated on Turkish embedding benchmark tasks. The table below summarizes the detailed task-level results.

Category	Task	Score
STS	STSbTR	77.5
NLI	SnliTr	60.8
NLI	XNLI	76.0
Retrieval	SquadTRRetrieval	62.3
Retrieval	MSMarcoTRRetrieval	57.4
Retrieval	TQuadRetrieval	79.5
Classification	THYSentimentClassification	59.5
Classification	TSTimelineNewsCategoryClassification	58.7
Classification	Turkish75NewsClassification	90.7
Classification	TurkishIronyClassification	52.6
Classification	TurkishMovieSentimentClassification	71.9
Classification	TurkishNewsCategoryClassification	88.8
Classification	TurkishOffensiveLanguageClassification	63.9
Classification	TurkishProductSentimentClassification	60.9
Clustering	TurkishAbstractCorpusClustering	58.9
Clustering	TurkishColumnWritingClustering	63.6
Bitext Mining	WMT16BitextMining	97.1
Other	ArguAnaTR	45.3
Other	NFCorpusTR	10.7
Overall	Average	69.5

2. Version Comparison: embeddingmagibu-200m vs embeddingmagibu-152m

The table below compares embeddingmagibu-200m with the previous embeddingmagibu-152m version on shared evaluation tasks.

Task	embeddingmagibu-200m	embeddingmagibu-152m	Difference
Average	69.5	67.0	+2.5
STSbTR	77.5	75.1	+2.4
SnliTr	60.8	55.4	+5.4
SquadTRRetrieval	62.3	68.7	-6.4
THYSentimentClassification	59.5	51.0	+8.5
TSTimelineNewsCategoryClassification	58.7	60.8	-2.1
Turkish75NewsClassification	90.7	92.7	-2.0
TurkishAbstractCorpusClustering	58.9	61.8	-2.9
TurkishColumnWritingClustering	63.6	61.8	+1.8
TurkishIronyClassification	52.6	48.4	+4.2
TurkishMovieSentimentClassification	71.9	67.3	+4.6
TurkishNewsCategoryClassification	88.8	90.8	-2.0
TurkishOffensiveLanguageClassification	63.9	59.6	+4.3
TurkishProductSentimentClassification	60.9	59.1	+1.8
WMT16BitextMining	97.1	91.9	+5.2
XNLI	76.0	60.8	+15.2

Model Architecture

This model uses the SentenceTransformers format with the following pipeline. The max_seq_length value is set to 8192.

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: Gemma3TextModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (4): Normalize()
)

Training and Adaptation Pipeline

The model was not trained from scratch. Instead, it was developed using an efficient three-stage adaptation pipeline.

1. Tokenizer Surgery

A Turkish-optimized multilingual tokenizer was constructed with a vocabulary size of 131,072 tokens.

The goal of this step was to improve Turkish tokenization efficiency while keeping multilingual capacity. This was done by pruning redundant tokens from the teacher vocabulary and incorporating multilingual tokens using frequency analysis over a balanced 40-language corpus.

2. Teacher Model Cloning

Instead of randomly initializing the full student model, the transformer backbone weights of the teacher model were preserved. A compatible embedding table was initialized for the new vocabulary through mean-composition token mapping.

This allows the student model to inherit the teacher model's representational capacity while adapting its vocabulary to Turkish and multilingual usage.

3. Offline Embedding Distillation

The student model was trained to approximate the teacher embedding space using precomputed teacher vectors.

This avoids online teacher inference during training, making the process faster, cheaper, and easier to reproduce. The distillation objective uses embedding-space similarity, such as cosine similarity, between student outputs and stored teacher embeddings.

The associated precomputed embedding dataset is available here:

https://huggingface.co/datasets/alibayram/wikipedia-40-langs-with-embeddings

Evaluation

STSbTR Results

The model was evaluated on STSbTR, a Turkish semantic textual similarity benchmark.

Model	Pearson	Spearman
intfloat/multilingual-e5-large-instruct	0.8275	0.8129
trmteb/turkish-embedding-model-fine-tuned	0.8215	0.8061
embeddingmagibu-200m	0.8199	0.7980
ytu-ce-cosmos/turkish-e5-large	0.8090	0.7906
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	0.7884	0.7659
google/embeddinggemma-300m	0.7391	0.7194

These results show that embeddingmagibu-200m performs strongly on Turkish semantic textual similarity and improves over the teacher model on this benchmark.

TR-MTEB Leaderboard Comparison

Rank	Model	Avg	STS	NLI	Retrieval	Classification	Clustering	Bitext	Other
1	intfloat/multilingual-e5-large-instruct	72.8	81.2	52.5	72.7	73.0	51.3	56.8	84.7
2	intfloat/multilingual-e5-large	72.3	81.2	55.8	72.6	80.1	61.1	58.1	88.6
3	ytu-ce-cosmos/turkish-e5-large	72.2	80.0	54.8	70.9	76.4	50.8	58.7	84.1
4	newmindai/TurkEmbed4STS	71.4	85.5	63.7	81.0	69.9	53.7	56.0	84.6
5	google/embeddinggemma-300m	71.0	72.9	54.7	67.6	73.3	-	-	-
6	selmanbaysan/turkish embedding model fine tuned	70.5	78.4	63.2	80.0	58.1	51.7	57.2	80.4
7	sentence-transformers/paraphrase-multilingual-mpnet-base-v2	69.8	82.2	60.7	82.8	58.0	46.2	51.5	65.9
8	alibaba-NLP/gte-multilingual-base	69.8	80.7	60.3	75.7	68.6	56.3	56.8	81.9
9	alibayram/embeddingmagibu-200m	69.5	77.5	60.8	76.0	62.3	-	57.4	79.5
10	intfloat/multilingual-e5-base	69.5	78.4	54.0	68.8	76.9	56.0	57.1	86.9

Usage

Installation

pip install -U sentence-transformers

Basic Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "alibayram/embeddingmagibu-200m",
    trust_remote_code=True
)

sentences = [
    "Bugün hava çok güzel.",
    "Dışarısı güneşli.",
    "Uzun bağlam gerektiren çok detaylı bir hukuki veya teknik metin..."
]

embeddings = model.encode(sentences, normalize_embeddings=True)

print(embeddings.shape)  # (3, 768)

Similarity Computation

similarities = embeddings @ embeddings.T
print(similarities)

When embeddings are normalized, dot product is equivalent to cosine similarity.

Query and Document Encoding

The model supports query/document-style encoding through SentenceTransformers.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "alibayram/embeddingmagibu-200m",
    trust_remote_code=True
)

query = "Yapay zeka modellerinde distillation nedir?"

documents = [
    "Distillation, büyük bir öğretmen modelin bilgisinin daha küçük bir öğrenci modele aktarılmasıdır.",
    "Yapay zeka günümüzde çok popüler.",
]

query_embedding = model.encode_query(query)
document_embeddings = model.encode_document(documents)

scores = model.similarity(query_embedding, document_embeddings)
print(scores)

Limitations

Although the model supports an 8,192-token context window, very long inputs may require more GPU memory.
For long documents, chunking may still be useful depending on the downstream retrieval setup.
The model is optimized for Turkish-centered use cases, but multilingual behavior may vary across languages.
The quality of the distilled model depends on the teacher model and the multilingual distillation corpus.
Benchmark results may vary depending on evaluation settings, pooling behavior, precision, and prompt/query formatting.

Related Resources

Paper:

https://arxiv.org/abs/2605.29992

Dataset:

https://huggingface.co/datasets/alibayram/wikipedia-40-langs-with-embeddings

Project page:

https://huggingface.co/spaces/magibu/embeddingmagibu-200m

GitHub repository:

https://github.com/malibayram/embedding-trainer

Citation

If you use this model in academic work, please cite the associated paper:

@article{bayram2026embeddingmagibu,
  title={Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation},
  author={Bayram, M. Ali and Diri, Banu and Yıldırım, Savaş},
  journal={arXiv preprint arXiv:2605.29992},
  year={2026},
  url={https://arxiv.org/abs/2605.29992}
}

You may also cite the model repository:

@misc{embeddingmagibu_200m_2026,
  title={embeddingmagibu-200m: Long-Context Turkish Sentence Embeddings},
  author={Bayram, M. Ali},
  year={2026},
  url={https://huggingface.co/alibayram/embeddingmagibu-200m}
}

Model Card Authors and Contact

M. Ali Bayram
Hugging Face: https://huggingface.co/alibayram
GitHub: https://github.com/malibayram

Downloads last month: 196

Safetensors

Model size

0.2B params

Tensor type

BF16

Spaces using magibu/embeddingmagibu-200m 2

Paper for magibu/embeddingmagibu-200m

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Paper • 2605.29992 • Published 10 days ago • 6