Post-Contest Cleaning LLM — Bidet AI v0.4+ Candidate Survey

Date: 2026-05-10 Task: noisy speech transcript → polished prose, preserving speaker voice + proper nouns Target hardware: Pixel 8 Pro (Tensor G3, no usable OpenCL, ~12 GB RAM) Co-resident: Moonshine STT model Constraint: ≤4B params, Apache 2.0 / MIT / similar permissive, Unsloth-fine-tunable, working Android deploy path.

TL;DR pick

Qwen3-4B-Instruct-2507 for the v0.4 spike. Highest IFEval score (83.4) of any sub-5B Apache-2.0 model that fits the ≤3 GB Q8 budget (~2.6 GB at Q4_K_M, ~4.4 GB at Q8 — Q5/Q6 is the realistic shipping point), full Unsloth tutorial coverage, and the cleanest llama.cpp Android path of the candidates. Cited 15–30 t/s on a Snapdragon 8 Gen 3 for 4B Q4 — Tensor G3 will be slower but in the same order, and that's the regime where the rewrite latency stays usable for a paragraph at a time.

Candidate matrix

Model	Params	License	Disk (Q4_K_M / Q8)	IFEval	MMLU	Unsloth	Android path
Gemma 4 E2B (baseline)	~2B effective	Gemma TOU (custom, mostly-permissive)	~1.5 / ~2.6 GB	not yet published; E4B beats E2B on reasoning/math/code per Google	n/a	yes	LiteRT-LM, llama.cpp
Gemma 3n E2B	~2B effective	Gemma TOU	~1.5 / ~2.6 GB	not in Google's headline table; ARC-E 75.8, HellaSwag 72.2	n/a	yes	LiteRT preview, llama.cpp
Gemma 3n E4B	~4B effective	Gemma TOU	~2.7 / ~4.6 GB	not headline-published; benchmarks emphasize ARC-E 81.6, BoolQ 81.6, PIQA 81.0, HellaSwag 78.6, Winogrande 71.7	n/a	yes	LiteRT preview, llama.cpp
Phi-4-mini-instruct	3.8B	MIT	~2.4 / ~4.0 GB	73.78 (HF model card / leaderboards)	72.6 (5-shot)	yes (Phi-4-mini-reasoning notebook)	llama.cpp ARM64; no first-party LiteRT
Qwen3-4B-Instruct-2507	4B	Apache 2.0	~2.6 / ~4.4 GB	83.4 (HF model card)	high; Arena-Hard v2 43.4	yes (full Qwen3 tutorial)	llama.cpp ARM64, MLC, sherpa-onnx
Qwen2.5-3B-Instruct	3B	Qwen RAIL (permissive but not OSI)	~1.9 / ~3.2 GB	reported as a relative weak spot for Qwen2.5 vs siblings	strong	yes	llama.cpp
Llama 3.2 3B Instruct	3B	Llama 3.2 Community (custom; usage-restricted)	~1.9 / ~3.2 GB	77.4 (Meta-published, "as strong as 8B")	n/a	yes	llama.cpp; Meta Executorch on Android
Llama 3.2 1B Instruct	1B	Llama 3.2 Community	~0.7 / ~1.1 GB	59.5 (Meta)	n/a	yes	same

Sources cited inline below.

Per-candidate notes

Gemma 4 E2B (baseline — what the contest forces)

Contest-locked, so it ships first regardless. No published IFEval at the time of writing.
Google positions Gemma 4 E-class as the on-device successor to 3n; LiteRT-LM is the official inference path.
License is Gemma Terms of Use, not OSI — fine for Mark's use case but worth noting if Bidet ever distributes derivatives.

Gemma 3n E4B

Mature on-device deployment story (this was the prior Kaggle on-device contest target).
Reported benchmarks are reasoning-heavy (ARC, HellaSwag, BoolQ); IFEval not in Google's headline table for the 3n line, which makes it a lateral move from Gemma 4 E2B for the rewrite task — bigger context, similar style.
Selective-parameter-activation means E4B's RAM footprint sits near 4 GB at runtime even though disk is ~2.7 GB Q4.

Phi-4-mini-instruct (3.8B, MIT)

Cleanest license of the bunch (MIT).
IFEval 73.78 is not bad but is a known weak spot — Microsoft's own technical report flags strict-instruction-following as the lowest of Phi-4-mini's scores.
For the specific "rewrite this transcript, keep proper nouns, keep voice" task, IFEval is the closest available proxy. Phi-4-mini under-performs Qwen3-4B on it.
Mixture-of-LoRAs architecture means Unsloth fine-tune patterns work but you're competing with the built-in adapter routing — empirically fine, but more moving parts than a dense Qwen/Llama LoRA.
No first-party Android runtime (no LiteRT, no Executorch); llama.cpp is the deploy path.

Qwen3-4B-Instruct-2507 (4B, Apache 2.0) — recommended

IFEval 83.4 — best in class at this size, and IFEval is the single benchmark that most directly proxies "follow the rewrite instructions exactly."
2025-07 refresh ("2507") explicitly highlighted "significant improvements in general capabilities, including instruction following" in the Qwen3 technical report.
Apache 2.0 — no license footgun.
Unsloth has a dedicated Qwen3 fine-tune tutorial; LoRA → GGUF export is documented, which is the exact pattern Bidet already uses for Whisper-mark.
Android: llama.cpp ARM64 build, Q4_K_M ~2.6 GB on disk; reported 15–30 t/s on Snapdragon 8 Gen 3 (cited unverified on Tensor G3, expect lower-end). MLC and sherpa-onnx also have Qwen3 paths.
Risk: 4B at Q5/Q6 + Moonshine + Android system overhead lands in the 4–5 GB working-set zone on a 12 GB phone. Tight but viable; Q4_K_M is the safe shipping target and only costs ~1–2 IFEval points in published ablations.

Qwen2.5-3B-Instruct (3B)

Smaller and faster than Qwen3-4B, but its license is Qwen RAIL (research-and-commercial-use-with-restrictions) — not Apache 2.0. Qwen2.5-7B's own tech report admits IFEval is the relative weakness vs Gemma2-9B / Llama3.1-8B at the 7B tier; the 3B will be weaker.
Strictly a fallback if Qwen3-4B can't fit memory budget.

Llama 3.2 3B Instruct

IFEval 77.4 is the surprise of the chart — Meta says it matches their 8B on this benchmark. Strong rewrite candidate.
Meta explicitly positions 3.2 1B/3B for "prompt rewriting, summarization, on-device assistants" — exact use case match.
Two real frictions: (1) Llama 3.2 Community License, not OSI (commercial use OK but with attribution + acceptable-use rules); (2) Executorch is the first-party Android path which is more involved than dropping a GGUF into llama.cpp.
Solid #2 pick if Apache-2.0 weren't a stated constraint.

Llama 3.2 1B Instruct

IFEval 59.5 — too weak for the cleanup task. Useful as a router or trivial-rewrite tier, not as the main model.

Recommendation

Spike Qwen3-4B-Instruct-2507 at Q4_K_M via llama.cpp on the Pixel 8 Pro. Reasons in order:

Highest published IFEval (83.4) of any sub-5B Apache-2.0 instruct model.
Apache 2.0 license — no derivative-distribution worries when sharing LoRAs.
Unsloth has a turnkey Qwen3 LoRA → GGUF tutorial, matching the existing Whisper-mark workflow.
llama.cpp ARM64 path is proven; 15–30 t/s benchmark on a Snapdragon 8 Gen 3 means usable interactive latency on Tensor G3 even at the slow end.

Fallback if memory or latency is tight: Llama 3.2 3B Instruct (IFEval 77.4, smaller, but custom license).

Skip: Phi-4-mini (lower IFEval, no first-party Android runtime), Qwen2.5-3B (license + weaker IFEval).

Sources

Qwen3-4B-Instruct-2507 — https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507
Qwen3 Technical Report — https://arxiv.org/pdf/2505.09388
Qwen2.5 Technical Report — https://arxiv.org/pdf/2412.15115
Phi-4-mini-instruct — https://huggingface.co/microsoft/Phi-4-mini-instruct
Phi-4-Mini Technical Report — https://arxiv.org/html/2503.01743v1
Llama 3.2 launch (HF blog) — https://github.com/huggingface/blog/blob/main/llama32.md
Llama 3.2 3B Instruct evals — https://huggingface.co/datasets/meta-llama/Llama-3.2-3B-Instruct-evals
Gemma 3n developer guide — https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/
Gemma 3n E4B model card — https://huggingface.co/google/gemma-3n-E4B-it
Gemma 4 launch — https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
Gemma 4 E4B model card — https://huggingface.co/google/gemma-4-E4B-it
Unsloth model catalog — https://unsloth.ai/docs/get-started/unsloth-model-catalog
Unsloth Qwen3 fine-tune — https://unsloth.ai/docs/models/tutorials/qwen3-how-to-run-and-fine-tune
Unsloth Phi-4 reasoning fine-tune — https://unsloth.ai/docs/models/tutorials/phi-4-reasoning-how-to-run-and-fine-tune
Android local-AI guide (4B Q4 on Snapdragon 8 Gen 3 at 15–30 t/s) — https://dev.to/alichherawalla/how-to-run-local-ai-on-your-android-phone-in-2026-no-cloud-no-account-5cbp
IFEval leaderboard — https://llm-stats.com/benchmarks/ifeval
LLMs in Your Pockets (mobile inference study) — https://arxiv.org/html/2410.03613v3