SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, soup_30)
sottoasr.app · MLX 5-bit (recommended) · MLX 4-bit (smaller) · Training Dataset
Overview
Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact — for on-device deployment on Apple Silicon, use the 5-bit MLX variant.
What's new (model soup release)
This model is a weight-space average of two strong checkpoints from the same fine-tuning lineage:
- 0.3 × v55 (latest: 2-epoch refinement at lr 2e-6) — strongest on number-accuracy and filler-stripping
- 0.7 × v51 (the prior production model) — strongest on adversarial sampling benchmark
Linear interpolation in weight space (θ = α·θ_v55 + (1-α)·θ_v51) is sometimes called "model souping". It works here because v55 was chained from v51 (same architecture, related minima), and the soup recovers v51's bench-sample strengths without losing v55's number/filler gains. The full recipe sweep is in the research journal (2026-05-06 loop).
Headline numbers (production-mode eval: max_new_tokens=900, repetition_penalty=1.05)
| Capability | v36 | v45 | v51 | v55 | soup (this) |
|---|---|---|---|---|---|
| Number accuracy (171-sample stratified val) | 12.9% | 95.9% | 95.3% | 96.5% | 96.5% |
| 66-case adversarial benchmark (greedy) | n/a | 76% | 84.8% | 84.8% | 86.4% |
| 66-case adversarial benchmark (temp 0.7 × 4) | n/a | 77% | 84.5% | 82.6% | 86.0% |
| Loops on 264 sampling-mode probes | n/a | 0 | 1 | 2 | 0 |
| Filler-free on 241 long inputs | 67.2% | 68.0% | 72.2% | 72.6% | 71.8% |
| Sub-deletion >15% on 241 long inputs | 13.3% | 13.7% | 4.6% | 5.0% | 5.0% |
Composite score (0.35×num + 0.30×bench_greedy + 0.15×bench_sample + 0.10×filler_long + 0.05×(1-sub15) + 0.05×(1-loops/N)): 89.51 at full production settings.
Training pipeline
LiquidAI/LFM2.5-350M-Base
→ SFT v23 → GRPO v23 (paragraph emission)
→ GRPO v36: full FT with substantive-deletion-aware reward
→ SFT v39: + 12.7K augmented number examples (ITN)
→ GRPO v40–v45: chained refinement, fixed reward + amplified filler penalty
→ GRPO v50 + v51: anti-loop n-gram penalty
→ GRPO v55: 2-epoch refinement at lr 2e-6 (best chained checkpoint)
→ soup: 0.3·θ_v55 + 0.7·θ_v51 (weight-space average — this model)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"juanquivilla/sotto-cleanup-lfm25-350m",
dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
text = "talk about server three sixty"
prompt = f"### Input:\n{text}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=max(900, int(len(text.split()) * 1.5)), # ≥1.5× input word count
do_sample=False,
repetition_penalty=1.05, # LFM2.5 official default
)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
output = output[:output.index("###")]
print(output.strip())
Inference recommendations
The headline numbers above use these settings — they match the LFM2.5 model card's defaults and are the production deployment for sottoasr.app:
repetition_penalty=1.05— LFM2.5's official default. Critical for long inputs: prevents the rare voicemail-style 5-gram loops that can occur withrepetition_penalty=1.0.max_new_tokens >= 1.5 × input_word_count(or 900 minimum) — long inputs (>200 words) need headroom; truncating mid-output looks like content deletion.do_sample=False(greedy) for deterministic output. If sampling is needed, usetemperature=0.1, top_k=50.
All Variants
| Variant | Size | Use Case |
|---|---|---|
| Full precision (this) | 676 MB | Training, GPU inference |
| MLX 5-bit | ~237 MB | Recommended for Apple Silicon |
| MLX 4-bit | ~195 MB | Smallest |
License
MIT
- Downloads last month
- 142