SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30)

sottoasr.app · MLX 5-bit (recommended) · MLX 4-bit (smaller) · Training Dataset

Overview

Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact — for on-device deployment on Apple Silicon, use the 5-bit MLX variant.

What's new (model soup release)

This model is a weight-space average of two strong checkpoints from the same fine-tuning lineage:

0.3 × v55 (latest: 2-epoch refinement at lr 2e-6) — strongest on number-accuracy and filler-stripping
0.7 × v51 (the prior production model) — strongest on adversarial sampling benchmark

Linear interpolation in weight space (θ = α·θ_v55 + (1-α)·θ_v51) is sometimes called "model souping". It works here because v55 was chained from v51 (same architecture, related minima), and the soup recovers v51's bench-sample strengths without losing v55's number/filler gains. The full recipe sweep is in the research journal (2026-05-06 loop).

Headline numbers (production-mode eval: `max_new_tokens=900`, `repetition_penalty=1.05`)

Capability	v36	v45	v51	v55	soup (this)
Number accuracy (171-sample stratified val)	12.9%	95.9%	95.3%	96.5%	96.5%
66-case adversarial benchmark (greedy)	n/a	76%	84.8%	84.8%	86.4%
66-case adversarial benchmark (temp 0.7 × 4)	n/a	77%	84.5%	82.6%	86.0%
Loops on 264 sampling-mode probes	n/a	0	1	2	0
Filler-free on 241 long inputs	67.2%	68.0%	72.2%	72.6%	71.8%
Sub-deletion >15% on 241 long inputs	13.3%	13.7%	4.6%	5.0%	5.0%

Composite score (0.35×num + 0.30×bench_greedy + 0.15×bench_sample + 0.10×filler_long + 0.05×(1-sub15) + 0.05×(1-loops/N)): 89.51 at full production settings.

Training pipeline

LiquidAI/LFM2.5-350M-Base
  → SFT v23 → GRPO v23 (paragraph emission)
  → GRPO v36: full FT with substantive-deletion-aware reward
  → SFT v39: + 12.7K augmented number examples (ITN)
  → GRPO v40–v45: chained refinement, fixed reward + amplified filler penalty
  → GRPO v50 + v51: anti-loop n-gram penalty
  → GRPO v55: 2-epoch refinement at lr 2e-6 (best chained checkpoint)
  → soup: 0.3·θ_v55 + 0.7·θ_v51 (weight-space average — this model)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "talk about server three sixty"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=max(900, int(len(text.split()) * 1.5)),  # ≥1.5× input word count
        do_sample=False,
        repetition_penalty=1.05,                                # LFM2.5 official default
    )
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())

Inference recommendations

The headline numbers above use these settings — they match the LFM2.5 model card's defaults and are the production deployment for sottoasr.app:

repetition_penalty=1.05 — LFM2.5's official default. Critical for long inputs: prevents the rare voicemail-style 5-gram loops that can occur with repetition_penalty=1.0.
max_new_tokens >= 1.5 × input_word_count (or 900 minimum) — long inputs (>200 words) need headroom; truncating mid-output looks like content deletion.
do_sample=False (greedy) for deterministic output. If sampling is needed, use temperature=0.1, top_k=50.

All Variants

Variant	Size	Use Case
Full precision (this)	676 MB	Training, GPU inference
MLX 5-bit	~237 MB	Recommended for Apple Silicon
MLX 4-bit	~195 MB	Smallest

License

MIT

Downloads last month: 142

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for juanquivilla/sotto-cleanup-lfm25-350m

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

(7)

this model

Quantizations

2 models

juanquivilla
/

sotto-cleanup-lfm25-350m

SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30)

Overview

What's new (model soup release)

Headline numbers (production-mode eval: `max_new_tokens=900`, `repetition_penalty=1.05`)

Training pipeline

Usage

Inference recommendations

All Variants

License

Model tree for juanquivilla/sotto-cleanup-lfm25-350m

Dataset used to train juanquivilla/sotto-cleanup-lfm25-350m

SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30)

Overview

What's new (model soup release)

Headline numbers (production-mode eval: max_new_tokens=900, repetition_penalty=1.05)

Training pipeline

Usage

Inference recommendations

All Variants

License

Model tree for juanquivilla/sotto-cleanup-lfm25-350m

Dataset used to train juanquivilla/sotto-cleanup-lfm25-350m

Headline numbers (production-mode eval: `max_new_tokens=900`, `repetition_penalty=1.05`)