Post-Contest Cleaning LLM — Bidet AI v0.4+ Candidate Survey
Date: 2026-05-10 Task: noisy speech transcript → polished prose, preserving speaker voice + proper nouns Target hardware: Pixel 8 Pro (Tensor G3, no usable OpenCL, ~12 GB RAM) Co-resident: Moonshine STT model Constraint: ≤4B params, Apache 2.0 / MIT / similar permissive, Unsloth-fine-tunable, working Android deploy path.
TL;DR pick
Qwen3-4B-Instruct-2507 for the v0.4 spike. Highest IFEval score (83.4) of any sub-5B Apache-2.0 model that fits the ≤3 GB Q8 budget (~2.6 GB at Q4_K_M, ~4.4 GB at Q8 — Q5/Q6 is the realistic shipping point), full Unsloth tutorial coverage, and the cleanest llama.cpp Android path of the candidates. Cited 15–30 t/s on a Snapdragon 8 Gen 3 for 4B Q4 — Tensor G3 will be slower but in the same order, and that's the regime where the rewrite latency stays usable for a paragraph at a time.
Candidate matrix
| Model | Params | License | Disk (Q4_K_M / Q8) | IFEval | MMLU | Unsloth | Android path |
|---|---|---|---|---|---|---|---|
| Gemma 4 E2B (baseline) | ~2B effective | Gemma TOU (custom, mostly-permissive) | ~1.5 / ~2.6 GB | not yet published; E4B beats E2B on reasoning/math/code per Google | n/a | yes | LiteRT-LM, llama.cpp |
| Gemma 3n E2B | ~2B effective | Gemma TOU | ~1.5 / ~2.6 GB | not in Google's headline table; ARC-E 75.8, HellaSwag 72.2 | n/a | yes | LiteRT preview, llama.cpp |
| Gemma 3n E4B | ~4B effective | Gemma TOU | ~2.7 / ~4.6 GB | not headline-published; benchmarks emphasize ARC-E 81.6, BoolQ 81.6, PIQA 81.0, HellaSwag 78.6, Winogrande 71.7 | n/a | yes | LiteRT preview, llama.cpp |
| Phi-4-mini-instruct | 3.8B | MIT | ~2.4 / ~4.0 GB | 73.78 (HF model card / leaderboards) | 72.6 (5-shot) | yes (Phi-4-mini-reasoning notebook) | llama.cpp ARM64; no first-party LiteRT |
| Qwen3-4B-Instruct-2507 | 4B | Apache 2.0 | ~2.6 / ~4.4 GB | 83.4 (HF model card) | high; Arena-Hard v2 43.4 | yes (full Qwen3 tutorial) | llama.cpp ARM64, MLC, sherpa-onnx |
| Qwen2.5-3B-Instruct | 3B | Qwen RAIL (permissive but not OSI) | ~1.9 / ~3.2 GB | reported as a relative weak spot for Qwen2.5 vs siblings | strong | yes | llama.cpp |
| Llama 3.2 3B Instruct | 3B | Llama 3.2 Community (custom; usage-restricted) | ~1.9 / ~3.2 GB | 77.4 (Meta-published, "as strong as 8B") | n/a | yes | llama.cpp; Meta Executorch on Android |
| Llama 3.2 1B Instruct | 1B | Llama 3.2 Community | ~0.7 / ~1.1 GB | 59.5 (Meta) | n/a | yes | same |
Sources cited inline below.
Per-candidate notes
Gemma 4 E2B (baseline — what the contest forces)
- Contest-locked, so it ships first regardless. No published IFEval at the time of writing.
- Google positions Gemma 4 E-class as the on-device successor to 3n; LiteRT-LM is the official inference path.
- License is Gemma Terms of Use, not OSI — fine for Mark's use case but worth noting if Bidet ever distributes derivatives.
Gemma 3n E4B
- Mature on-device deployment story (this was the prior Kaggle on-device contest target).
- Reported benchmarks are reasoning-heavy (ARC, HellaSwag, BoolQ); IFEval not in Google's headline table for the 3n line, which makes it a lateral move from Gemma 4 E2B for the rewrite task — bigger context, similar style.
- Selective-parameter-activation means E4B's RAM footprint sits near 4 GB at runtime even though disk is ~2.7 GB Q4.
Phi-4-mini-instruct (3.8B, MIT)
- Cleanest license of the bunch (MIT).
- IFEval 73.78 is not bad but is a known weak spot — Microsoft's own technical report flags strict-instruction-following as the lowest of Phi-4-mini's scores.
- For the specific "rewrite this transcript, keep proper nouns, keep voice" task, IFEval is the closest available proxy. Phi-4-mini under-performs Qwen3-4B on it.
- Mixture-of-LoRAs architecture means Unsloth fine-tune patterns work but you're competing with the built-in adapter routing — empirically fine, but more moving parts than a dense Qwen/Llama LoRA.
- No first-party Android runtime (no LiteRT, no Executorch); llama.cpp is the deploy path.
Qwen3-4B-Instruct-2507 (4B, Apache 2.0) — recommended
- IFEval 83.4 — best in class at this size, and IFEval is the single benchmark that most directly proxies "follow the rewrite instructions exactly."
- 2025-07 refresh ("2507") explicitly highlighted "significant improvements in general capabilities, including instruction following" in the Qwen3 technical report.
- Apache 2.0 — no license footgun.
- Unsloth has a dedicated Qwen3 fine-tune tutorial; LoRA → GGUF export is documented, which is the exact pattern Bidet already uses for Whisper-mark.
- Android: llama.cpp ARM64 build, Q4_K_M ~2.6 GB on disk; reported 15–30 t/s on Snapdragon 8 Gen 3 (cited unverified on Tensor G3, expect lower-end). MLC and sherpa-onnx also have Qwen3 paths.
- Risk: 4B at Q5/Q6 + Moonshine + Android system overhead lands in the 4–5 GB working-set zone on a 12 GB phone. Tight but viable; Q4_K_M is the safe shipping target and only costs ~1–2 IFEval points in published ablations.
Qwen2.5-3B-Instruct (3B)
- Smaller and faster than Qwen3-4B, but its license is Qwen RAIL (research-and-commercial-use-with-restrictions) — not Apache 2.0. Qwen2.5-7B's own tech report admits IFEval is the relative weakness vs Gemma2-9B / Llama3.1-8B at the 7B tier; the 3B will be weaker.
- Strictly a fallback if Qwen3-4B can't fit memory budget.
Llama 3.2 3B Instruct
- IFEval 77.4 is the surprise of the chart — Meta says it matches their 8B on this benchmark. Strong rewrite candidate.
- Meta explicitly positions 3.2 1B/3B for "prompt rewriting, summarization, on-device assistants" — exact use case match.
- Two real frictions: (1) Llama 3.2 Community License, not OSI (commercial use OK but with attribution + acceptable-use rules); (2) Executorch is the first-party Android path which is more involved than dropping a GGUF into llama.cpp.
- Solid #2 pick if Apache-2.0 weren't a stated constraint.
Llama 3.2 1B Instruct
- IFEval 59.5 — too weak for the cleanup task. Useful as a router or trivial-rewrite tier, not as the main model.
Recommendation
Spike Qwen3-4B-Instruct-2507 at Q4_K_M via llama.cpp on the Pixel 8 Pro. Reasons in order:
- Highest published IFEval (83.4) of any sub-5B Apache-2.0 instruct model.
- Apache 2.0 license — no derivative-distribution worries when sharing LoRAs.
- Unsloth has a turnkey Qwen3 LoRA → GGUF tutorial, matching the existing Whisper-mark workflow.
- llama.cpp ARM64 path is proven; 15–30 t/s benchmark on a Snapdragon 8 Gen 3 means usable interactive latency on Tensor G3 even at the slow end.
Fallback if memory or latency is tight: Llama 3.2 3B Instruct (IFEval 77.4, smaller, but custom license).
Skip: Phi-4-mini (lower IFEval, no first-party Android runtime), Qwen2.5-3B (license + weaker IFEval).
Sources
- Qwen3-4B-Instruct-2507 — https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507
- Qwen3 Technical Report — https://arxiv.org/pdf/2505.09388
- Qwen2.5 Technical Report — https://arxiv.org/pdf/2412.15115
- Phi-4-mini-instruct — https://huggingface.co/microsoft/Phi-4-mini-instruct
- Phi-4-Mini Technical Report — https://arxiv.org/html/2503.01743v1
- Llama 3.2 launch (HF blog) — https://github.com/huggingface/blog/blob/main/llama32.md
- Llama 3.2 3B Instruct evals — https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-Instruct-evals
- Gemma 3n developer guide — https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/
- Gemma 3n E4B model card — https://huggingface.co/google/gemma-3n-E4B-it
- Gemma 4 launch — https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
- Gemma 4 E4B model card — https://huggingface.co/google/gemma-4-E4B-it
- Unsloth model catalog — https://unsloth.ai/docs/get-started/unsloth-model-catalog
- Unsloth Qwen3 fine-tune — https://unsloth.ai/docs/models/tutorials/qwen3-how-to-run-and-fine-tune
- Unsloth Phi-4 reasoning fine-tune — https://unsloth.ai/docs/models/tutorials/phi-4-reasoning-how-to-run-and-fine-tune
- Android local-AI guide (4B Q4 on Snapdragon 8 Gen 3 at 15–30 t/s) — https://dev.to/alichherawalla/how-to-run-local-ai-on-your-android-phone-in-2026-no-cloud-no-account-5cbp
- IFEval leaderboard — https://llm-stats.com/benchmarks/ifeval
- LLMs in Your Pockets (mobile inference study) — https://arxiv.org/html/2410.03613v3