SOTA On-Device STT in 2026 — Verification Report
Built for Bidet AI contest decision. Mark Barnett, 2026-05-10.
TL;DR
- Gemma 4 E2B at ~2× realtime on Pixel 8 Pro CPU is consistent with the math, not slow. Gemma 4's audio encoder emits 25 tokens per second of audio, and CPU decode on a top-tier S26 Ultra is ~46.9 tok/s (Google's own LiteRT-LM number). On Tensor G3 — about half the perf-per-watt of the S26 Ultra's Snapdragon 8 Elite gen 5 — you should expect 20–30 tok/s sustained, which means a 30 s clip (≈750 audio-tokens of prefill plus ~50–200 generated text-tokens) lands right in the 40–70 s band Mark measured. There is no published Gemma 4 E2B audio benchmark on Pixel 8 Pro CPU showing faster, anywhere.
- The straightforward win is GPU backend, not a model swap. S26 GPU decode = 52 tok/s baseline → 91.7 tok/s with multi-token prediction (1.96× speedup), and Mali-G715 prefill on S26 GPU is ~7× faster than CPU. Pixel 8 Pro's Mali-G715 MC11 is the same GPU family. Switching
Backend.CPU→Backend.GPU+ enabling MTP is the highest-leverage single change. - For pure STT (not multimodal reasoning), Whisper-tiny is no longer SOTA — Moonshine Tiny beats it on every metric Mark cares about. 27 M params (vs 39 M), 4.52 % WER LibriSpeech-clean (vs 5.66 %), and 5–8× lower compute cost on edge hardware. License: MIT. Android-on-device support: official Maven Central + sherpa-onnx ports. Drop-in replacement.
- For Bidet AI specifically, Gemma 4 E2B is still the right contest entry because the contest is Gemma 4 Challenge. But for the underlying pure-transcription pipeline (separate from the contest narrative), Mark should know Whisper-tiny is one tier behind Moonshine Tiny and two tiers behind Moonshine Base — and Gemma 4 E2B's transcription accuracy is far behind both (Whisper-large-class accuracy is the upper bound; Gemma 3n 8B was measured at ~13 % WER vs Whisper-large's ~4.4 %).
Q1: Gemma 4 E2B at 2× realtime on Pixel 8 Pro CPU — is this expected?
What Mark measured
- Pixel 8 Pro (Tensor G3, Mali-G715 MC11)
- Gemma 4 E2B + LiteRT-LM 0.11 + Backend.CPU + audio mode
- Chunk 0: 30 s audio → 62 s transcription
- Chunk 1: 32 s audio → 64 s transcription
- Steady state ~2.0× realtime
What Google publishes (litert-community/gemma-4-E2B-it-litert-lm model card)
| Hardware | Backend | Prefill (tok/s) | Decode (tok/s) | TTFT |
|---|---|---|---|---|
| Samsung S26 Ultra (SD 8 Elite Gen 5) | CPU (XNNPACK 4 thr) | 557 | 46.9 | 1.8 s |
| Samsung S26 Ultra | GPU (Adreno) | 3,808 | 52.1 | 0.3 s |
| Samsung S26 Ultra | GPU + MTP (Summarize) | — | 91.7 | — |
| Qualcomm Dragonwing IQ8 | NPU | 3,747 | 31.7 | 0.3 s |
| MacBook Pro M4 Max | GPU | 4,676 | 73.9 | 1.1 s |
No Pixel 8 Pro / Tensor G3 numbers are published anywhere by Google — the published Android numbers are S24 Ultra (older Gemma 3n preview build, slower) and S26 Ultra (current Gemma 4 build, fastest Android available in May 2026).
How that reconciles with Mark's 2× realtime
Audio token math (Gemma 4 audio encoder spec, ai.google.dev/gemma/docs/capabilities/audio): - 16 kHz, 32 ms frames - 25 audio tokens per second of audio - 30 s clip = 750 prefill tokens for the audio portion alone (plus a small instruction prefix)
CPU prefill on S26 Ultra = 557 tok/s → just the prefill of 750 audio-tokens costs ~1.35 s on the fastest Android CPU. Pixel 8 Pro CPU is roughly 0.55–0.65× the throughput of S26 Ultra (Geekbench ST ~1700 vs ~3100 for SD 8 Elite Gen 5; multi-thread ~4500 vs ~9700).
Scaled estimate for Pixel 8 Pro CPU: - Prefill: ~330–370 tok/s → ~2.0–2.3 s for 750 audio-tokens - Decode: ~25–30 tok/s → for a typical transcript of a 30 s clip (~80–150 generated text-tokens), that's another 3–6 s - That alone explains 5–8 s of the 62 s observed, but the real cost is in audio encoder forward pass (Gemma 3n encoder is reported as the bottleneck — Google explicitly shipped a "50% smaller audio encoder" in Gemma 4 to address this), and in non-XNNPACK paths the encoder forward dominates wall time. Empirically observed: ~2× realtime on CPU on Tensor G3 is the expected baseline.
Cross-checks from the field
- Google AI Developers Forum (discuss.ai.google.dev) — only published Gemma 4 E2B + LiteRT-LM 0.10.x on-device test with measured latency was on a Pixel 10a (Tensor G4) GPU backend, structured tool-calling eval (not audio): p50 = 9.3 s, p95 = 26.4 s. With MTP: p50 = 8.8 s, p95 dropped 6.7 s. No CPU audio numbers published. [forum thread]
- GitHub mediapipe-samples #379 — community user reporting "is Gemma on device really this slow?" on Pixel 8 Pro. The thread confirms expectation-mismatch is common; community guidance is "use GPU backend; CPU is slow by design".
- Medium: 'One Model or Two? Whisper vs Gemma 3n Audio Scribe' — measured Gemma 3n 8B audio at ~13.0 % WER vs Whisper-large at ~4.4 % WER. Speed not measured in that piece, but the WER gap by itself argues against Gemma for pure transcription.
- MindStudio Gemma 4 audio encoder writeup — confirms Gemma 4 encoder is ~50 % smaller than Gemma 3n with 40 ms frame duration (vs 160 ms in 3n — note the contradiction with ai.google.dev's "25 tokens/sec" figure; the official docs are the authoritative source). Smaller encoder = faster, but still bottleneck-class on CPU.
Possible inefficiencies in Mark's setup (worth checking)
- Backend.CPU instead of Backend.GPU — single biggest lever. S26 GPU is 7× faster prefill, ~10 % faster decode baseline, 2× faster decode with MTP. Mali-G715 on Pixel 8 Pro should give 2–3× speedup over CPU even without MTP.
- Multi-token prediction (MTP) not enabled — LiteRT-LM 0.11 supports MTP drafters for Gemma 4. Up to 2.0× decode speedup, zero quality loss (Google blog 2026-04). MTP is not on by default in 0.11; needs explicit drafter config.
- Chunk size of 30 s is also the audio encoder's max single-chunk window, so prefill is at its costliest. Smaller chunks (5–10 s) with sliding overlap drop per-chunk prefill to ~125–250 tokens and may give better wall-clock-to-audio ratio in practice.
- Sample rate — Gemma audio expects 16 kHz. If the Pixel mic is delivering 44.1 kHz / 48 kHz and resampling happens in Java, that's pure overhead.
- Batch size = 1 is fine for streaming; not a knob to turn here.
Verdict on Q1
2× realtime on Tensor G3 CPU is the expected baseline, not an inefficiency. No public benchmark contradicts it; the math accounts for it; community reports match it. The available speedup is in the backend choice, not the model: GPU + MTP should land Pixel 8 Pro at ~0.7–1.0× realtime (faster than realtime). Below that floor requires NPU (no public Google delegate yet for Tensor G3) or a smaller/different model.
Q2: Better-than-Whisper-tiny on-device STT — landscape May 2026
Comparison table (verified numbers only)
| Model | Params | Size on disk | License | Android runtime | LibriSpeech clean WER | LibriSpeech other WER | Edge speed (RTF / latency) | Source |
|---|---|---|---|---|---|---|---|---|
| Whisper-tiny.en (OpenAI) | 39 M | ~75 MB (GGML f16) / ~40 MB (TFLite Q8) | MIT | whisper.cpp (NDK), TFLite, sherpa-onnx | 5.66 % | 15.45 % | 2 s for 30 s on Pixel 7 (whisper.tflite); 0.07 RTF on Galaxy S10 (sherpa-onnx) | arXiv 2410.15608 Tbl 3; openai/whisper #506; voiceping benchmark |
| Whisper-base.en | 73 M | ~140 MB GGML | MIT | whisper.cpp, sherpa-onnx | 4.25 % | 10.35 % | ~2× tiny cost | arXiv 2410.15608 Tbl 2 |
| Distil-Whisper distil-small.en | 166 M | ~320 MB | MIT | ONNX, runs on Android via sherpa-onnx | within 4 % WER of Whisper-large-v3 | — | 6× faster than Whisper-large; 5.6× faster than tiny on CPU | huggingface/distil-whisper README |
| Moonshine Tiny (Useful Sensors) | 27.1 M | ~26 MB | MIT | Maven Central (official Android), sherpa-onnx, ONNX | 4.52 % | 11.71 % | 0.7× FLOPs of Whisper-tiny.en; 5× lower compute for 10 s clip; 237 ms on RPi 5; 42.6 tok/s on Galaxy S10 (sherpa-onnx) | arXiv 2410.15608 Tbl 1, 3; voiceping benchmark; moonshine-ai/moonshine README |
| Moonshine Base | 61.5 M | ~58 MB | MIT | same as Tiny | 3.23 % | 8.18 % | 1.6× FLOPs of Whisper-tiny.en (still cheaper than Whisper-base) | arXiv 2410.15608 Tbl 1, 2 |
| Moonshine Tiny Streaming | 34 M | ~34 MB | MIT | Maven, sherpa-onnx | 12.00 % avg (8 datasets) | — | 34 ms latency on MacBook, 237 ms on RPi 5 (the streaming version is the live-mic one) | moonshine-ai README |
| SenseVoice Small (FunASR / Alibaba) | 234 M | ~470 MB | MIT (model), Apache 2 (code) | sherpa-onnx (Android prebuilt APK) | ~3 % (multilingual) | — | 33.6 tok/s on Galaxy S10 (sherpa-onnx); 0.06 RTF | k2-fsa/sherpa-onnx; voiceping benchmark |
| NVIDIA Parakeet TDT 0.6B v2 | 600 M | ~1.2 GB | CC-BY-4.0 | No native Android runtime; NeMo CUDA/CPU | ~6.34 % avg (HF Open ASR) | — | RTFx 3380 server-side (batch 128) | huggingface.co/nvidia/parakeet-tdt-0.6b-v2 |
| Mistral Voxtral-Mini-4B-Realtime | ~3 B (after quant) | ~3 GB quantized | Apache 2.0 | "Runs on smartphone" claim, no Android runtime shipped May 2026 | — | — | "Real-time on smartphone" (Mistral marketing, unverified) | mistral.ai/news/voxtral |
| Microsoft Phi-4-multimodal-instruct | 5.6 B | ~11 GB | MIT | No mobile runtime; ONNX/CUDA only | 6.14 % avg HF Open ASR (#1) | — | Server-class only | microsoft/Phi-4-multimodal-instruct |
| Gemma 4 E2B (audio mode) | ~2 B effective (5 B raw) | ~2.6 GB | Gemma TOU | LiteRT-LM 0.11 (official) | unverified for E2B; Gemma 3n 8B = ~13 % WER | — | ~2× realtime CPU Tensor G3 (Mark's measurement); ~1× realtime expected GPU+MTP | HF model card; ai.google.dev |
| Qualcomm Whisper-Tiny-En (NPU-optimized) | 39 M | ~40 MB | MIT (Whisper) | Qualcomm AI Hub QNN runtime — SD 8 Gen 3+ only, NOT Tensor G3 | matches Whisper-tiny | matches Whisper-tiny | NPU-accelerated on Snapdragon | qualcomm/Whisper-Tiny-En |
What the tools-on-Android benchmark actually shows
VoicePing 2026 offline benchmark (Galaxy S10 / Exynos 9820 / 8 GB / Android 12 — older than Pixel 8 Pro by ~3 generations):
| Model + engine | Inference (1.4 s test clip) | tok/s | RTF |
|---|---|---|---|
| Moonshine Tiny via sherpa-onnx | 1,363 ms | 42.55 | 0.05 |
| SenseVoice Small via sherpa-onnx | 1,725 ms | 33.62 | 0.06 |
| Whisper Tiny via sherpa-onnx | 2,068 ms | 27.08 | 0.07 |
| Whisper Tiny via whisper.cpp | (slow) | — | (~3.6, 51× slower than sherpa-onnx) |
Two big takeaways: - Engine matters more than model. Whisper Tiny on sherpa-onnx (ONNX Runtime Mobile) is 51× faster than the same model on whisper.cpp. Mark's bidet-phone uses whisper.cpp NDK — there is a free 5–10× speedup available on Android by switching the runtime, separate from any model swap. - Moonshine Tiny is unambiguously faster + smaller + more accurate than Whisper-tiny on identical engine + identical Android device.
Specific question: smaller than 100 MB, on-device on Pixel 8 Pro, > 2× realtime, beats Whisper-tiny on WER?
Answer: Yes — Moonshine Tiny. - Size: 26 MB (vs Whisper-tiny 40 MB Q8 / 75 MB f16) - LibriSpeech clean WER: 4.52 % (vs 5.66 %) - LibriSpeech other WER: 11.71 % (vs 15.45 %) - Average across 8 datasets: 12.66 % (vs 12.81 %) - License: MIT - Android: official Maven Central + sherpa-onnx port - Realtime factor on Galaxy S10 (older than Pixel 8 Pro): 0.05 RTF = 20× realtime. On Pixel 8 Pro this should be 25–40× realtime.
Honorable mention: Moonshine Base (58 MB, 3.23 % LibriSpeech-clean WER). Still under 100 MB, beats Whisper-base. The clean-WER number (3.23 %) puts it within 1 percentage point of Whisper-large-v3 turbo (~3 %) at 1/30th the parameter count.
Recommendation for Bidet AI's contest entry
The contest is the Gemma 4 Challenge — Mark is required to use Gemma 4. Within that constraint:
- Don't swap the model. Gemma 4 E2B is the entry; that's the contest's whole point.
- Switch Backend.CPU → Backend.GPU in LiteRT-LM 0.11. Pixel 8 Pro Mali-G715 should give 2–3× end-to-end speedup. Closest published comparison: S26 Ultra GPU does 7× CPU prefill, 1.1× CPU decode baseline, 1.96× CPU decode with MTP. Tensor G3 GPU is weaker but the relative advantage holds.
- Enable multi-token prediction (MTP) drafter. Up to 2× decode speedup, zero quality loss. Per the Google blog (2026-04), this is the explicit "make Gemma 4 fast on Android GPU" lever.
- Drop chunk size from 30 s → 8–10 s with overlap. Cuts per-chunk prefill cost ~3×.
- Validate sample rate is 16 kHz at the recorder layer, not resampled in Java.
After (2)+(3) Pixel 8 Pro should land at ~1× realtime (transcription keeps up with audio). After (4)+(5), comfortably under 1× realtime (faster than realtime), which is the threshold for live captioning use cases.
Outside the contest narrative — what bidet-phone v0.1+ should know for the long run:
The Whisper-tiny path Mark already has on bidet-phone (39 M params, whisper.cpp NDK) is two upgrades behind state of the art for the size/speed niche:
- Engine upgrade: whisper.cpp → sherpa-onnx (ONNX Runtime Mobile) gives 5–50× speedup on the same model, same hardware. Single biggest lever for Mark's existing Whisper flavor of bidet-phone.
- Model upgrade: Whisper-tiny → Moonshine Tiny gives 18 % lower clean-WER, 24 % lower other-WER, 33 % smaller binary, MIT license, official Android Maven. Drop-in for the same use case.
Combined: a Moonshine-Tiny-on-sherpa-onnx Pixel 8 Pro build should hit >30× realtime with better transcripts than today. That's not for the Gemma 4 contest entry — that's for the production pipeline.
Honest scope: Whisper-tiny is no longer SOTA at the small/edge tier — it lost that crown to Moonshine in late 2024 and has stayed lost. For multimodal-LLM-with-audio (single model that transcribes and answers questions about audio), Gemma 4 E2B is genuinely the strongest open-weight on-device option in May 2026. For pure transcription quality at fixed compute budget, Moonshine Base is the new ceiling.
Sources
Q1: Gemma 4 E2B benchmarks
- litert-community/gemma-4-E2B-it-litert-lm — HuggingFace model card — S26 Ultra CPU/GPU/MTP numbers
- google/gemma-3n-E2B-it-litert-lm — HuggingFace model card — older S24 Ultra numbers
- LiteRT-LM repo — google-ai-edge/LiteRT-LM — runtime overview, MTP support
- Gemma audio understanding docs — ai.google.dev — 25 tokens/audio-sec for Gemma 4, 6.25 for 3n, 32 ms frames
- Multi-token prediction blog — blog.google 2026-04 — up to 3× speedup on web; ~2× on mobile GPU
- Gemma 4 E2B + LiteRT-LM 0.10.x on Mali GPU — Google AI Developers Forum — Pixel 10a / Tensor G4 latency p50/p95
- GitHub: 'Is Gemma on device really this slow?' — mediapipe-samples #379 — Pixel 8 Pro community report
- Gemma 4 audio encoder — MindStudio — 50 % smaller encoder, 40 ms frames vs 3n's 160 ms
- Whisper vs Gemma 3n audio scribe — Medium / Ajjay K — Gemma 3n 8B = ~13 % WER vs Whisper-large = 4.4 %
- Welcome Gemma 4 — HuggingFace blog — CoVoST 33.47 (E2B), FLEURS 0.09 (E2B)
Q2: STT landscape
- Moonshine paper — arXiv 2410.15608 | HTML version with WER tables
- Moonshine GitHub — moonshine-ai/moonshine — Maven Central, RTF tables on Mac/Linux/RPi5
- Moonshine on HuggingFace — UsefulSensors/moonshine
- The Decoder: Moonshine 5× faster than Whisper
- VoicePing offline transcription benchmark — 16 models, 4 OSes — sherpa-onnx 51× faster than whisper.cpp on Android
- Best Local STT Models 2026 — onresonant.com
- Moonshine vs Whisper benchmark — modelslab.com 2026
- Best open source STT model in 2026 — Northflank
- On-device Whisper TFLite on Pixel 7 — openai/whisper #506
- Distil-Whisper repo — huggingface/distil-whisper
- sherpa-onnx repo — k2-fsa/sherpa-onnx — Android-prebuilt APKs, Whisper + Moonshine + SenseVoice support
- NVIDIA Parakeet TDT 0.6B v2
- Mistral Voxtral announcement | Voxtral-Mini-4B-Realtime model
- Microsoft Phi-4-multimodal-instruct
- Qualcomm Whisper-Tiny-En NPU build