MyPO Training - Python Type Hint Fine-Tuning

This repository contains training scripts for fine-tuning code language models to generate Python code with modern type hints by default.

Goal

Train otherwise-good Python coding LLMs to write code that is by-default compliant with:

  • ruff (linting)
  • black (formatting)
  • mypy --strict (type checking)

Dataset

Source: joshuasundance/mypo-4k-rfc

  • Train: 4,000 examples
  • Validation: 2,361 examples
  • Combined: 6,361 examples (4,000 train + 2,361 validation)

Format: DPO (Direct Preference Optimization)

  • prompt: Coding instruction
  • chosen: Type-hinted Python code (target)
  • rejected: Non-type-hinted Python code (baseline)

Base Model

Qwen/Qwen2.5-Coder-1.5B-Instruct

  • 1.5B parameters
  • Excellent code generation capabilities
  • Apache 2.0 license
  • Already instruction-tuned

Training Approaches

1. SFT (Supervised Fine-Tuning)

Script: mypo_sft_train.py

Converts DPO format to conversational SFT format and trains on the "chosen" (type-hinted) responses.

Key Hyperparameters:

Parameter Value Rationale
LoRA rank (r) 256 Higher rank needed for code tasks
LoRA alpha 16 Standard 1/16 ratio
Target modules all-linear Critical for matching full fine-tuning
Learning rate 2e-4 10x higher for LoRA
Epochs 3 More epochs for style adaptation
Batch size 1 (micro) × 8 (accumulation) Effective batch = 8
Packing True Efficient token usage
Max length 2048 Handle longer code

Output: joshuasundance/mypo-qwen2.5-coder-1.5b-sft

2. DPO (Direct Preference Optimization)

DPO has gone through several iterations. v3 is the current production recipe; v4 is a draft with added observability / early stopping for future runs.

Script Status Output Repo Notes
mypo_dpo_train.py Legacy joshuasundance/mypo-qwen2.5-coder-1.5b-dpo First attempt, superseded
mypo_dpo_train_v2.py Failed (no-op) joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v2 LoRA α=16 / r=256 (scale 0.0625) × lr=1e-6 produced infinitesimal weight deltas. Ranking objective satisfied without moving argmax decoding → indistinguishable from base model in characterization. See repo README for failure-mode writeup.
mypo_dpo_train_v3.py Production joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3 Fixes v2: warm-start from SFT (merge before LoRA), α = r = 256 (scale 1.0), lr = 5e-5, β = 0.3, 2 epochs, bf16 full precision, adamw_torch. Published as a fully merged model, not an adapter. Characterized here — matches SFT on quality gates, edges gold chosen at 52.7% preference win-rate (first model to exceed 50% vs gold).
mypo_dpo_train_v4.py Draft (unverified) joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v4 v3 recipe + Trackio dashboard + CodeCarbon→Trackio bridge + held-out eval split + EarlyStoppingCallback + load_best_model_at_end. Quieter CodeCarbon logs. Not yet run; preserved as the next-iteration design.

Key v3 / v4 hyperparameters:

Parameter Value Rationale
LoRA rank (r) 256 Higher rank needed for code tasks
LoRA alpha 256 Matched to r → scale = 1.0 (v2's 16 → scale = 0.0625 was too small)
Target modules all-linear Critical for matching full fine-tuning
Learning rate 5e-5 50× v2; LoRA with matched scale needs more lr
Beta 0.3 Stronger preference margins (v2 used 0.1)
Epochs 2 (cap) Warm-start + higher scale/lr → converges fast; v4 adds early stopping
Warm-start SFT merged into base DPO then optimizes beyond SFT, not from scratch
Precision bf16 full 1.5B fits on A10G 24 GB; avoids 4-bit merge artifacts
Optimizer adamw_torch NOT a paged_adamw_* variant (would require bitsandbytes)

v4-only additions:

  • report_to=["trackio"] + trackio.init(space_id="joshuasundance/mypo-trackio") → live dashboard
  • CarbonToTrackio TrainerCallback forwards CodeCarbon cumulative energy/CO₂ to the same dashboard via tracker.flush() + final_emissions_data
  • Manual EmissionsTracker(log_level="error", measure_power_secs=30) → far less log noise
  • 2% eval split + eval_steps=50 + EarlyStoppingCallback(patience=3) on eval_rewards/margins
  • load_best_model_at_end=True → ships the best checkpoint, not the last

Improvements Over Previous Attempt

The previous attempt (phi3-mini-4k-qlora-python-code-20k-mypo-4k-rfc-pipe) showed limited improvement. This version addresses:

  1. Higher LoRA rank: r=256 instead of r=64-128
  2. All-linear targeting: Instead of just q_proj, v_proj, etc.
  3. Higher learning rate: 2e-4 for LoRA (was too low)
  4. Better base model: Qwen2.5-Coder vs Phi-3
  5. More epochs: 3 instead of 1
  6. Packing enabled: More efficient training

Research Basis

These hyperparameters are based on:

  • "LoRA Without Regret" (Schulman 2025): Higher rank (r=256) + all-linear targeting matches full fine-tuning
  • InverseCoder (2407.05700): Code-to-NL training strategies
  • MFTCoder (2311.02303): Multi-task learning for code

Running Training

# SFT training
python mypo_sft_train.py

# DPO training (current production recipe)
python mypo_dpo_train_v3.py

# DPO training (v4 draft — adds Trackio + eval/early-stop; unverified)
python mypo_dpo_train_v4.py

Or via Hugging Face Jobs (recommended — A10G-large, 24 GB):

hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \
  https://huggingface.co/joshuasundance/mypo-training/raw/main/mypo_dpo_train_v3.py

Monitoring

  • v3 and earlier: CodeCarbon emissions tracking (report_to=["codecarbon"]) → emissions.csv uploaded to the model repo alongside weights.
  • v4 (draft): Trackio live dashboard at joshuasundance/mypo-trackio (auto-created on first run). Training metrics and CodeCarbon energy/CO₂ readings on the same time axis via a custom TrainerCallback. CodeCarbon still writes emissions.csv for the README, but at log_level="error" so the Jobs log isn't dominated by 15-second energy updates.

Evaluation

Characterization pipeline under eval/ — three separable stages so each can be run on the cheapest appropriate HF Jobs flavor:

Stage Script Flavor What it does
Generate eval/mypo_generate.py a10g-large Loads base / SFT / DPO models, generates responses for a sample of prompts, writes samples.jsonl + metadata.json + emissions.csv to generations/<run-id>/
Analyze eval/mypo_analyze.py cpu-upgrade (per-subject, fan out in parallel) Runs ruff / black / mypy --strict / coverage on each subject's outputs, writes per-subject JSON + CSV under analysis/<run-id>/
Report eval/mypo_report.py cpu-basic Rolls analyses into reports/<run-id>/{CHARACTERIZATION.md, summary.json, summary.csv}

The --include-dpo-v3 flag on mypo_generate.py adds the v3 merged model as an additional subject column. Future v4 characterization will use the same flag pattern (--include-dpo-v4) once v4 training is run.

Example outputs

We published a runnable side-by-side demo at examples/reproduce_v3.py and executed it on HF Jobs (69e959a92aa1660eaffa8ca6) to confirm the script really works.

For the prompt Write a function that returns the nth Fibonacci number. the observed behavior was:

  • Base already returned a typed iterative solution with a small driver snippet.
  • SFT matched the base almost exactly on this prompt.
  • DPO-v2 returned a fenced recursive implementation plus a natural-language explanation, and the function argument was still untyped.
  • DPO-v3 returned a concise typed recursive implementation (def fibonacci(n: int) -> Union[int, float]: ...).

That prompt is a useful smoke test, but it is not the main evidence because the base model already behaves fairly well there. The stronger evidence is the 150-prompt batched characterization (base and v2 pass mypy --strict only 6 % of the time, while SFT reaches 92.7 % and v3 reaches 92.0 %, with v3 also achieving the best annotation-slot coverage and the only >50 % win-rate vs gold) combined with the 30-prompt single-prompt validation below (base/v2 0 %, SFT/v3 73.3 % \u2014 same direction, same magnitude, different decoding regime).

If you want to reproduce a specific row from the published eval artifacts, use examples/reproduce_eval_row.py instead. That script replays the target row inside its original batch window from samples.jsonl, which matters because the original generation stage decoded prompts in batches of 8 with left padding. We validated this distinction on row 13: a one-prompt replay did not match the stored sample, but replaying the exact 8-prompt batch did.

Single-prompt validation (n=30)

The published batched characterization decodes with batch_size=8 and padding_side='left'. To confirm the pipeline's headline claim — that training moves the model from "never annotates" to "almost always annotates" — is a real-world effect and not a batching artifact, we re-decoded 30 stratified validation prompts with batch_size=1 and no padding across base / SFT / DPO v2 / DPO v3. Artifacts: single-prompt-validation/single-prompt-2026-04-23T002137Z/.

metric base dpo-v2 SFT dpo-v3
mypy --strict pass 0.0 % 0.0 % 73.3 % 73.3 %
annotation slot coverage 0.000 0.000 0.971 0.976
black pass 6.7 % 6.7 % 100 % 96.7 %

Three takeaways:

  1. The core claim survives real-world inference. 0 % → 73 % mypy --strict is the direction and magnitude the batched number implies, not a batching artifact.
  2. The batched n=150 and single-prompt n=30 validations are different measurement regimes. They produce different absolute scores, but we no longer attribute that gap specifically to left-padding or batching as a general causal explanation.
  3. SFT and v3 are statistically indistinguishable at n=30 single-prompt. v3's clearer advantage over SFT remains the 52.7 % preference win-rate vs gold in the n=150 batched eval (first model in the pipeline to exceed 50 % vs gold).

HumanEval+ external benchmark (n=164)

We ran a full evalplus HumanEval+ benchmark across base / SFT / DPO v2 / DPO v3. This is a stronger out-of-domain code-generation benchmark than the MyPO internal validation slices above.

subject pass@1 base tests pass@1 plus tests
base 111 / 164 (67.7%) 98 / 164 (59.8%)
dpo-v2 111 / 164 (67.7%) 98 / 164 (59.8%)
sft 99 / 164 (60.4%) 90 / 164 (54.9%)
dpo-v3 92 / 164 (56.1%) 80 / 164 (48.8%)

Interpretation:

  • The MyPO tuning pipeline changes the model's type-hinting behavior on its own prompt distribution, but it does not improve general HumanEval+ correctness.
  • DPO v2 is effectively a no-op on HumanEval+ as well as on the earlier MyPO validation slices.
  • DPO v3 should not be described as a generally stronger coding model than the Qwen base. Its gains are narrow and in-domain.

Alpaca-stripped validation (n=30)

The training prompts are in Stanford-Alpaca format (### Instruction: / ### Input: / ### Output:). To rule out "the models just learned to respond to Alpaca scaffolding" — rather than learning a prompt-shape-robust coding behavior — we re-ran the 30-prompt single-prompt validation with the scaffold removed: the model sees only the bare instruction, delivered as a normal user turn through Qwen's chat template. Artifacts: alpaca-stripped-validation/alpaca-stripped-2026-04-23T005911Z/. Eval job: 69e96eba2aa1660eaffa8d00.

Metric base DPO v2 SFT DPO v3
mypy --strict pass (wrapped) 0.0 % 0.0 % 73.3 % 73.3 %
mypy --strict pass (stripped) 3.3 % 3.3 % 76.7 % 73.3 %
Annotation slot coverage (stripped) 0.00 0.00 0.97 0.94

The +70 pt base→v3 gap survives scaffold removal essentially intact. This is strong evidence that training taught a transferable coding-behavior change (emit bare Python, annotate parameters, satisfy mypy-strict) rather than a scaffold-specific response pattern.

See also SIDE-BY-SIDE.md for concrete example-by-example output comparisons from this run.

Dataset separability note

Independent inspection (dataset-inspection/2026-04-22/) shows the training dataset's chosen/rejected preference signal is trivially separable by surface features:

  • 99.8 % of chosen rows have a type annotation vs 2.8 % of rejected
  • 69.6 % of chosen open with from typing import … vs 0.0 % of rejected
  • 0 % of either column uses markdown fences (so unfencing is learned from imitation of chosen outputs only)

This explains why DPO's rewards/accuracies hit ≈1.0 by epoch 0.3: the preference objective was shallow. Whether the resulting model behavior generalizes is a separate question, and the Alpaca-stripped result above suggests it does at least within this prompt distribution.

Compute & environmental impact

Every training and characterization job in this pipeline runs on Hugging Face Jobs and logs energy + CO₂ with CodeCarbon v3.2.6. Each model repo ships its own emissions.csv; the numbers below are the project-wide totals across all of SFT + DPO v2 + DPO v3 + characterization on AWS us-east-1 (Virginia, USA, PUE 1.0) on a single NVIDIA A10G per job.

Stage Flavor Wall-clock Energy CO₂e Approx cost
SFT training a10g-large 2.32 h (8 340 s) 0.472 kWh 0.174 kg ~$3.50
DPO v2 training (failed no-op) a10g-large 3.04 h (10 938 s) 0.646 kWh 0.238 kg ~$4.60
DPO v3 training (production) a10g-large 1.67 h (6 005 s) 0.363 kWh 0.134 kg ~$2.50
Characterization — generate (4 subjects × 150 prompts) a10g-large 0.26 h (937 s) 0.052 kWh 0.019 kg ~$0.40
Characterization — 6 analysis jobs cpu-upgrade ×6 parallel ~3 min each ~0.01 kWh ~0.004 kg <$0.05
Characterization — rollup report cpu-basic <1 min negligible negligible ~$0
Project cumulative ~7.3 h GPU ~1.55 kWh ~0.57 kg CO₂e ~$11

Cost estimates are approximate, derived from HF Jobs published per-flavor rates at the time of the runs. Energy and CO₂ are measured, not estimated.

Emissions tracking

We use CodeCarbon v3.2.6 to measure the energy and carbon footprint of every training and evaluation job in this repo. Each training script configures DPOConfig(..., report_to=["codecarbon"]) (TRL then manages the tracker) so that emissions.csv is written alongside model weights and uploaded to the Hub. The v4 draft additionally bridges CodeCarbon into the Trackio dashboard via a custom TrainerCallback so that cumulative energy / CO₂ is visible on the same time axis as loss and reward margins.

If you use artifacts or findings from this repo, please also cite CodeCarbon (see citation below).

License

Apache 2.0 (same as base model)

Citations

This project

@software{mypo_training,
  title   = {{MyPO Training: Python Type Hint Fine-Tuning}},
  author  = {Bailey, Joshua Sundance},
  year    = 2026,
  url     = {https://huggingface.co/joshuasundance/mypo-training}
}

CodeCarbon (emissions tracking)

@software{codecarbon,
  author  = {Benoit Courty and Victor Schmidt and Sasha Luccioni and Goyal-Kamal and MarionCoutarel and Boris Feld and Jérémy Lecourt and LiamConnell and Amine Saboni and Inimaz and supatomic and Mathilde Léval and Luis Blanche and Alexis Cruveiller and Ouminasara and Franklin Zhao and Aditya Joshi and Alexis Bogroff and Hugues de Lavoreille and Niko Laskaris and Edoardo Abati and Douglas Blank and Ziyao Wang and Armin Catovic and Marc Alencon and Michał Stęchły and Christian Bauer and Lucas Otávio N. de Araújo and JPW and MinervaBooks},
  title   = {{CodeCarbon: Estimate and track carbon emissions from machine learning computing}},
  year    = 2024,
  doi     = {10.5281/zenodo.11171501},
  url     = {https://github.com/mlco2/codecarbon}
}

DPO

@inproceedings{rafailov2023direct,
  title     = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
  author    = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
  booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
  year      = 2023
}

TRL

@software{vonwerra2020trl,
  title   = {{TRL: Transformer Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = 2020
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train joshuasundance/mypo-training

Collection including joshuasundance/mypo-training