- MyPO Training - Python Type Hint Fine-Tuning
- Goal
- Dataset
- Base Model
- Training Approaches
- Improvements Over Previous Attempt
- Research Basis
- Running Training
- Monitoring
- Evaluation
- Example outputs
- Single-prompt validation (n=30)
- HumanEval+ external benchmark (n=164)
- Alpaca-stripped validation (n=30)
- Compute & environmental impact
- Emissions tracking
- License
- Citations
- Goal
MyPO Training - Python Type Hint Fine-Tuning
This repository contains training scripts for fine-tuning code language models to generate Python code with modern type hints by default.
Goal
Train otherwise-good Python coding LLMs to write code that is by-default compliant with:
- ruff (linting)
- black (formatting)
- mypy --strict (type checking)
Dataset
Source: joshuasundance/mypo-4k-rfc
- Train: 4,000 examples
- Validation: 2,361 examples
- Combined: 6,361 examples (
4,000 train + 2,361 validation)
Format: DPO (Direct Preference Optimization)
prompt: Coding instructionchosen: Type-hinted Python code (target)rejected: Non-type-hinted Python code (baseline)
Base Model
Qwen/Qwen2.5-Coder-1.5B-Instruct
- 1.5B parameters
- Excellent code generation capabilities
- Apache 2.0 license
- Already instruction-tuned
Training Approaches
1. SFT (Supervised Fine-Tuning)
Script: mypo_sft_train.py
Converts DPO format to conversational SFT format and trains on the "chosen" (type-hinted) responses.
Key Hyperparameters:
| Parameter | Value | Rationale |
|---|---|---|
| LoRA rank (r) | 256 | Higher rank needed for code tasks |
| LoRA alpha | 16 | Standard 1/16 ratio |
| Target modules | all-linear | Critical for matching full fine-tuning |
| Learning rate | 2e-4 | 10x higher for LoRA |
| Epochs | 3 | More epochs for style adaptation |
| Batch size | 1 (micro) × 8 (accumulation) | Effective batch = 8 |
| Packing | True | Efficient token usage |
| Max length | 2048 | Handle longer code |
Output: joshuasundance/mypo-qwen2.5-coder-1.5b-sft
2. DPO (Direct Preference Optimization)
DPO has gone through several iterations. v3 is the current production recipe; v4 is a draft with added observability / early stopping for future runs.
| Script | Status | Output Repo | Notes |
|---|---|---|---|
mypo_dpo_train.py |
Legacy | joshuasundance/mypo-qwen2.5-coder-1.5b-dpo |
First attempt, superseded |
mypo_dpo_train_v2.py |
Failed (no-op) | joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v2 |
LoRA α=16 / r=256 (scale 0.0625) × lr=1e-6 produced infinitesimal weight deltas. Ranking objective satisfied without moving argmax decoding → indistinguishable from base model in characterization. See repo README for failure-mode writeup. |
mypo_dpo_train_v3.py |
Production | joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3 |
Fixes v2: warm-start from SFT (merge before LoRA), α = r = 256 (scale 1.0), lr = 5e-5, β = 0.3, 2 epochs, bf16 full precision, adamw_torch. Published as a fully merged model, not an adapter. Characterized here — matches SFT on quality gates, edges gold chosen at 52.7% preference win-rate (first model to exceed 50% vs gold). |
mypo_dpo_train_v4.py |
Draft (unverified) | joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v4 |
v3 recipe + Trackio dashboard + CodeCarbon→Trackio bridge + held-out eval split + EarlyStoppingCallback + load_best_model_at_end. Quieter CodeCarbon logs. Not yet run; preserved as the next-iteration design. |
Key v3 / v4 hyperparameters:
| Parameter | Value | Rationale |
|---|---|---|
| LoRA rank (r) | 256 | Higher rank needed for code tasks |
| LoRA alpha | 256 | Matched to r → scale = 1.0 (v2's 16 → scale = 0.0625 was too small) |
| Target modules | all-linear | Critical for matching full fine-tuning |
| Learning rate | 5e-5 | 50× v2; LoRA with matched scale needs more lr |
| Beta | 0.3 | Stronger preference margins (v2 used 0.1) |
| Epochs | 2 (cap) | Warm-start + higher scale/lr → converges fast; v4 adds early stopping |
| Warm-start | SFT merged into base | DPO then optimizes beyond SFT, not from scratch |
| Precision | bf16 full | 1.5B fits on A10G 24 GB; avoids 4-bit merge artifacts |
| Optimizer | adamw_torch |
NOT a paged_adamw_* variant (would require bitsandbytes) |
v4-only additions:
report_to=["trackio"]+trackio.init(space_id="joshuasundance/mypo-trackio")→ live dashboardCarbonToTrackioTrainerCallback forwards CodeCarbon cumulative energy/CO₂ to the same dashboard viatracker.flush()+final_emissions_data- Manual
EmissionsTracker(log_level="error", measure_power_secs=30)→ far less log noise - 2% eval split +
eval_steps=50+EarlyStoppingCallback(patience=3)oneval_rewards/margins load_best_model_at_end=True→ ships the best checkpoint, not the last
Improvements Over Previous Attempt
The previous attempt (phi3-mini-4k-qlora-python-code-20k-mypo-4k-rfc-pipe) showed limited improvement. This version addresses:
- Higher LoRA rank: r=256 instead of r=64-128
- All-linear targeting: Instead of just q_proj, v_proj, etc.
- Higher learning rate: 2e-4 for LoRA (was too low)
- Better base model: Qwen2.5-Coder vs Phi-3
- More epochs: 3 instead of 1
- Packing enabled: More efficient training
Research Basis
These hyperparameters are based on:
- "LoRA Without Regret" (Schulman 2025): Higher rank (r=256) + all-linear targeting matches full fine-tuning
- InverseCoder (2407.05700): Code-to-NL training strategies
- MFTCoder (2311.02303): Multi-task learning for code
Running Training
# SFT training
python mypo_sft_train.py
# DPO training (current production recipe)
python mypo_dpo_train_v3.py
# DPO training (v4 draft — adds Trackio + eval/early-stop; unverified)
python mypo_dpo_train_v4.py
Or via Hugging Face Jobs (recommended — A10G-large, 24 GB):
hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \
https://huggingface.co/joshuasundance/mypo-training/raw/main/mypo_dpo_train_v3.py
Monitoring
- v3 and earlier: CodeCarbon emissions tracking (
report_to=["codecarbon"]) →emissions.csvuploaded to the model repo alongside weights. - v4 (draft): Trackio live dashboard at
joshuasundance/mypo-trackio(auto-created on first run). Training metrics and CodeCarbon energy/CO₂ readings on the same time axis via a custom TrainerCallback. CodeCarbon still writesemissions.csvfor the README, but atlog_level="error"so the Jobs log isn't dominated by 15-second energy updates.
Evaluation
Characterization pipeline under eval/ — three separable stages so each can be run on the cheapest appropriate HF Jobs flavor:
| Stage | Script | Flavor | What it does |
|---|---|---|---|
| Generate | eval/mypo_generate.py |
a10g-large |
Loads base / SFT / DPO models, generates responses for a sample of prompts, writes samples.jsonl + metadata.json + emissions.csv to generations/<run-id>/ |
| Analyze | eval/mypo_analyze.py |
cpu-upgrade (per-subject, fan out in parallel) |
Runs ruff / black / mypy --strict / coverage on each subject's outputs, writes per-subject JSON + CSV under analysis/<run-id>/ |
| Report | eval/mypo_report.py |
cpu-basic |
Rolls analyses into reports/<run-id>/{CHARACTERIZATION.md, summary.json, summary.csv} |
The --include-dpo-v3 flag on mypo_generate.py adds the v3 merged model as an additional subject column. Future v4 characterization will use the same flag pattern (--include-dpo-v4) once v4 training is run.
Example outputs
We published a runnable side-by-side demo at examples/reproduce_v3.py and executed it on HF Jobs (69e959a92aa1660eaffa8ca6) to confirm the script really works.
For the prompt Write a function that returns the nth Fibonacci number. the observed behavior was:
- Base already returned a typed iterative solution with a small driver snippet.
- SFT matched the base almost exactly on this prompt.
- DPO-v2 returned a fenced recursive implementation plus a natural-language explanation, and the function argument was still untyped.
- DPO-v3 returned a concise typed recursive implementation (
def fibonacci(n: int) -> Union[int, float]: ...).
That prompt is a useful smoke test, but it is not the main evidence because the base model already behaves fairly well there. The stronger evidence is the 150-prompt batched characterization (base and v2 pass mypy --strict only 6 % of the time, while SFT reaches 92.7 % and v3 reaches 92.0 %, with v3 also achieving the best annotation-slot coverage and the only >50 % win-rate vs gold) combined with the 30-prompt single-prompt validation below (base/v2 0 %, SFT/v3 73.3 % \u2014 same direction, same magnitude, different decoding regime).
If you want to reproduce a specific row from the published eval artifacts, use examples/reproduce_eval_row.py instead. That script replays the target row inside its original batch window from samples.jsonl, which matters because the original generation stage decoded prompts in batches of 8 with left padding. We validated this distinction on row 13: a one-prompt replay did not match the stored sample, but replaying the exact 8-prompt batch did.
Single-prompt validation (n=30)
The published batched characterization decodes with batch_size=8 and padding_side='left'. To confirm the pipeline's headline claim — that training moves the model from "never annotates" to "almost always annotates" — is a real-world effect and not a batching artifact, we re-decoded 30 stratified validation prompts with batch_size=1 and no padding across base / SFT / DPO v2 / DPO v3. Artifacts: single-prompt-validation/single-prompt-2026-04-23T002137Z/.
| metric | base | dpo-v2 | SFT | dpo-v3 |
|---|---|---|---|---|
mypy --strict pass |
0.0 % | 0.0 % | 73.3 % | 73.3 % |
| annotation slot coverage | 0.000 | 0.000 | 0.971 | 0.976 |
black pass |
6.7 % | 6.7 % | 100 % | 96.7 % |
Three takeaways:
- The core claim survives real-world inference. 0 % → 73 %
mypy --strictis the direction and magnitude the batched number implies, not a batching artifact. - The batched n=150 and single-prompt n=30 validations are different measurement regimes. They produce different absolute scores, but we no longer attribute that gap specifically to left-padding or batching as a general causal explanation.
- SFT and v3 are statistically indistinguishable at n=30 single-prompt. v3's clearer advantage over SFT remains the 52.7 % preference win-rate vs gold in the n=150 batched eval (first model in the pipeline to exceed 50 % vs gold).
HumanEval+ external benchmark (n=164)
We ran a full evalplus HumanEval+ benchmark across base / SFT / DPO v2 / DPO v3. This is a stronger out-of-domain code-generation benchmark than the MyPO internal validation slices above.
| subject | pass@1 base tests | pass@1 plus tests |
|---|---|---|
base |
111 / 164 (67.7%) | 98 / 164 (59.8%) |
dpo-v2 |
111 / 164 (67.7%) | 98 / 164 (59.8%) |
sft |
99 / 164 (60.4%) | 90 / 164 (54.9%) |
dpo-v3 |
92 / 164 (56.1%) | 80 / 164 (48.8%) |
Interpretation:
- The MyPO tuning pipeline changes the model's type-hinting behavior on its own prompt distribution, but it does not improve general HumanEval+ correctness.
- DPO v2 is effectively a no-op on HumanEval+ as well as on the earlier MyPO validation slices.
- DPO v3 should not be described as a generally stronger coding model than the Qwen base. Its gains are narrow and in-domain.
Alpaca-stripped validation (n=30)
The training prompts are in Stanford-Alpaca format (### Instruction: / ### Input: / ### Output:). To rule out "the models just learned to respond to Alpaca scaffolding" — rather than learning a prompt-shape-robust coding behavior — we re-ran the 30-prompt single-prompt validation with the scaffold removed: the model sees only the bare instruction, delivered as a normal user turn through Qwen's chat template. Artifacts: alpaca-stripped-validation/alpaca-stripped-2026-04-23T005911Z/. Eval job: 69e96eba2aa1660eaffa8d00.
| Metric | base | DPO v2 | SFT | DPO v3 |
|---|---|---|---|---|
mypy --strict pass (wrapped) |
0.0 % | 0.0 % | 73.3 % | 73.3 % |
mypy --strict pass (stripped) |
3.3 % | 3.3 % | 76.7 % | 73.3 % |
| Annotation slot coverage (stripped) | 0.00 | 0.00 | 0.97 | 0.94 |
The +70 pt base→v3 gap survives scaffold removal essentially intact. This is strong evidence that training taught a transferable coding-behavior change (emit bare Python, annotate parameters, satisfy mypy-strict) rather than a scaffold-specific response pattern.
See also SIDE-BY-SIDE.md for concrete example-by-example output comparisons from this run.
Dataset separability note
Independent inspection (dataset-inspection/2026-04-22/) shows the training dataset's chosen/rejected preference signal is trivially separable by surface features:
- 99.8 % of
chosenrows have a type annotation vs 2.8 % ofrejected - 69.6 % of
chosenopen withfrom typing import …vs 0.0 % ofrejected - 0 % of either column uses markdown fences (so unfencing is learned from imitation of chosen outputs only)
This explains why DPO's rewards/accuracies hit ≈1.0 by epoch 0.3: the preference objective was shallow. Whether the resulting model behavior generalizes is a separate question, and the Alpaca-stripped result above suggests it does at least within this prompt distribution.
Compute & environmental impact
Every training and characterization job in this pipeline runs on Hugging Face Jobs and logs energy + CO₂ with CodeCarbon v3.2.6. Each model repo ships its own emissions.csv; the numbers below are the project-wide totals across all of SFT + DPO v2 + DPO v3 + characterization on AWS us-east-1 (Virginia, USA, PUE 1.0) on a single NVIDIA A10G per job.
| Stage | Flavor | Wall-clock | Energy | CO₂e | Approx cost |
|---|---|---|---|---|---|
| SFT training | a10g-large |
2.32 h (8 340 s) | 0.472 kWh | 0.174 kg | ~$3.50 |
| DPO v2 training (failed no-op) | a10g-large |
3.04 h (10 938 s) | 0.646 kWh | 0.238 kg | ~$4.60 |
| DPO v3 training (production) | a10g-large |
1.67 h (6 005 s) | 0.363 kWh | 0.134 kg | ~$2.50 |
| Characterization — generate (4 subjects × 150 prompts) | a10g-large |
0.26 h (937 s) | 0.052 kWh | 0.019 kg | ~$0.40 |
| Characterization — 6 analysis jobs | cpu-upgrade ×6 parallel |
~3 min each | ~0.01 kWh | ~0.004 kg | <$0.05 |
| Characterization — rollup report | cpu-basic |
<1 min | negligible | negligible | ~$0 |
| Project cumulative | — | ~7.3 h GPU | ~1.55 kWh | ~0.57 kg CO₂e | ~$11 |
Cost estimates are approximate, derived from HF Jobs published per-flavor rates at the time of the runs. Energy and CO₂ are measured, not estimated.
Emissions tracking
We use CodeCarbon v3.2.6 to measure the energy and carbon footprint of every training and evaluation job in this repo. Each training script configures DPOConfig(..., report_to=["codecarbon"]) (TRL then manages the tracker) so that emissions.csv is written alongside model weights and uploaded to the Hub. The v4 draft additionally bridges CodeCarbon into the Trackio dashboard via a custom TrainerCallback so that cumulative energy / CO₂ is visible on the same time axis as loss and reward margins.
If you use artifacts or findings from this repo, please also cite CodeCarbon (see citation below).
License
Apache 2.0 (same as base model)
Citations
This project
@software{mypo_training,
title = {{MyPO Training: Python Type Hint Fine-Tuning}},
author = {Bailey, Joshua Sundance},
year = 2026,
url = {https://huggingface.co/joshuasundance/mypo-training}
}
CodeCarbon (emissions tracking)
@software{codecarbon,
author = {Benoit Courty and Victor Schmidt and Sasha Luccioni and Goyal-Kamal and MarionCoutarel and Boris Feld and Jérémy Lecourt and LiamConnell and Amine Saboni and Inimaz and supatomic and Mathilde Léval and Luis Blanche and Alexis Cruveiller and Ouminasara and Franklin Zhao and Aditya Joshi and Alexis Bogroff and Hugues de Lavoreille and Niko Laskaris and Edoardo Abati and Douglas Blank and Ziyao Wang and Armin Catovic and Marc Alencon and Michał Stęchły and Christian Bauer and Lucas Otávio N. de Araújo and JPW and MinervaBooks},
title = {{CodeCarbon: Estimate and track carbon emissions from machine learning computing}},
year = 2024,
doi = {10.5281/zenodo.11171501},
url = {https://github.com/mlco2/codecarbon}
}
DPO
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
year = 2023
}
TRL
@software{vonwerra2020trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = 2020
}