Mark's Reports

Post-Contest Cleaning LLM — Bidet AI v0.4+ Candidate Survey

Date: 2026-05-10 Task: noisy speech transcript → polished prose, preserving speaker voice + proper nouns Target hardware: Pixel 8 Pro (Tensor G3, no usable OpenCL, ~12 GB RAM) Co-resident: Moonshine STT model Constraint: ≤4B params, Apache 2.0 / MIT / similar permissive, Unsloth-fine-tunable, working Android deploy path.

TL;DR pick

Qwen3-4B-Instruct-2507 for the v0.4 spike. Highest IFEval score (83.4) of any sub-5B Apache-2.0 model that fits the ≤3 GB Q8 budget (~2.6 GB at Q4_K_M, ~4.4 GB at Q8 — Q5/Q6 is the realistic shipping point), full Unsloth tutorial coverage, and the cleanest llama.cpp Android path of the candidates. Cited 15–30 t/s on a Snapdragon 8 Gen 3 for 4B Q4 — Tensor G3 will be slower but in the same order, and that's the regime where the rewrite latency stays usable for a paragraph at a time.

Candidate matrix

Model Params License Disk (Q4_K_M / Q8) IFEval MMLU Unsloth Android path
Gemma 4 E2B (baseline) ~2B effective Gemma TOU (custom, mostly-permissive) ~1.5 / ~2.6 GB not yet published; E4B beats E2B on reasoning/math/code per Google n/a yes LiteRT-LM, llama.cpp
Gemma 3n E2B ~2B effective Gemma TOU ~1.5 / ~2.6 GB not in Google's headline table; ARC-E 75.8, HellaSwag 72.2 n/a yes LiteRT preview, llama.cpp
Gemma 3n E4B ~4B effective Gemma TOU ~2.7 / ~4.6 GB not headline-published; benchmarks emphasize ARC-E 81.6, BoolQ 81.6, PIQA 81.0, HellaSwag 78.6, Winogrande 71.7 n/a yes LiteRT preview, llama.cpp
Phi-4-mini-instruct 3.8B MIT ~2.4 / ~4.0 GB 73.78 (HF model card / leaderboards) 72.6 (5-shot) yes (Phi-4-mini-reasoning notebook) llama.cpp ARM64; no first-party LiteRT
Qwen3-4B-Instruct-2507 4B Apache 2.0 ~2.6 / ~4.4 GB 83.4 (HF model card) high; Arena-Hard v2 43.4 yes (full Qwen3 tutorial) llama.cpp ARM64, MLC, sherpa-onnx
Qwen2.5-3B-Instruct 3B Qwen RAIL (permissive but not OSI) ~1.9 / ~3.2 GB reported as a relative weak spot for Qwen2.5 vs siblings strong yes llama.cpp
Llama 3.2 3B Instruct 3B Llama 3.2 Community (custom; usage-restricted) ~1.9 / ~3.2 GB 77.4 (Meta-published, "as strong as 8B") n/a yes llama.cpp; Meta Executorch on Android
Llama 3.2 1B Instruct 1B Llama 3.2 Community ~0.7 / ~1.1 GB 59.5 (Meta) n/a yes same

Sources cited inline below.

Per-candidate notes

Gemma 4 E2B (baseline — what the contest forces)

Gemma 3n E4B

Phi-4-mini-instruct (3.8B, MIT)

Qwen3-4B-Instruct-2507 (4B, Apache 2.0) — recommended

Qwen2.5-3B-Instruct (3B)

Llama 3.2 3B Instruct

Llama 3.2 1B Instruct

Recommendation

Spike Qwen3-4B-Instruct-2507 at Q4_K_M via llama.cpp on the Pixel 8 Pro. Reasons in order:

  1. Highest published IFEval (83.4) of any sub-5B Apache-2.0 instruct model.
  2. Apache 2.0 license — no derivative-distribution worries when sharing LoRAs.
  3. Unsloth has a turnkey Qwen3 LoRA → GGUF tutorial, matching the existing Whisper-mark workflow.
  4. llama.cpp ARM64 path is proven; 15–30 t/s benchmark on a Snapdragon 8 Gen 3 means usable interactive latency on Tensor G3 even at the slow end.

Fallback if memory or latency is tight: Llama 3.2 3B Instruct (IFEval 77.4, smaller, but custom license).

Skip: Phi-4-mini (lower IFEval, no first-party Android runtime), Qwen2.5-3B (license + weaker IFEval).

Sources