Take a brain dump. AI cleans up your mess.

Building Bidet AI: a 100% on-device Android brain-dump cleaner with Gemma 4 — the human story, the technical guts, and what surprised us

I'm a middle-school teacher. Twenty-five years. I have adult ADD. The hardest part of my job has always been writing report card comments to parents — six hours of overthinking until I'd fall back on "your child is a joy to teach." Then I figured out I could just talk into an AI, and everything changed.

This is the story of building Bidet AI for the Kaggle Gemma 4 Good Hackathon — a 100% on-device Android app that takes voice brain-dumps and cleans them with Gemma 4 E4B running on the phone's CPU. No cloud. No upload. The audio and the transcript never leave the device.

But this post isn't the contest pitch. It's the developer log — what worked, what didn't, what surprised us, and what we learned about shipping Gemma 4 on a phone in May 2026.

The tagline tells you everything

Take a brain dump. AI cleans up your mess.

The product mechanic is one sentence. You hit Record. You talk for up to 45 minutes — scattered, repeated, stuttering, however your brain comes out. Audio is captured in 30-second chunks and transcribed on-device by sherpa-onnx Moonshine. As each chunk lands, the raw text appears live. Stop when you're done. Then you choose:

• Clean for me — Gemma 4 rewrites your dump into the format your brain reads best. For me that's tight bullets, grouped by topic. It's my ADD output, designed for how I learn.

• Clean for others — same brain dump, different audience. Email. Report card. Class notes for a kid who missed yesterday's lecture. Or context for the next AI agent in the chain.

Both outputs are produced 100% on-device. Nothing crosses the network.

That's the surface. Now let me tell you what's underneath, because the surface is misleadingly simple.

The architecture (and the version history that taught us each piece)

sherpa-onnx Moonshine for STT

We started with whisper.cpp + Whisper-tiny. It worked but the ONNX runtime collision with LiteRT-LM kept biting us. Switched to sherpa-onnx 1.13.1 with the static-link AAR (sherpa-onnx-static-link-onnxruntime-1.13.1.aar) so the static onnxruntime symbols don't collide with LiteRT-LM's bundled ones. Picked Moonshine-Tiny because it's the smallest model in the sherpa-onnx family and Tensor G3 CPU has to share with Gemma cleaning.

Gotcha #1: Moonshine's quantized graph has a hard ~9-second input cap. Audio chunks must be sub-chunked to ≤8 seconds with 0.5s overlap or the encoder silently truncates. We learned this by recording a 30-second session that came back with the first 9 seconds transcribed and nothing else. Sub-chunking takes ~30ms; we'd never have noticed except the empty output was so obvious.

Gotcha #2: Moonshine-Tiny hallucinates aggressively during silence and breath gaps. Specific things we see:

• Music-note runs (♪♪♪♪♪♪) when the audio is quiet — Moonshine was trained on YouTube

• Thai-script or CJK trailers when it ran out of English to predict

• The literal phrase "I'm just going to go to the bathroom" inserted during silence (a YouTube-corpus ghost)

• Repeat-token loops: "card card card card…" ×27, "well well well…" ×50, "uh uh uh…" ×60

We built a post-Moonshine sanitizer (~80 lines of regex) that strips these BEFORE the text lands in the DB. Mishears that need cross-context to fix (e.g. "Pixar" for "fix" or "Bidet AI" → "the day AI") stay in — the cleaning model picks those up.

Gemma 4 E4B for cleaning, via LiteRT-LM

The cleaning runs through Google's LiteRT-LM Kotlin SDK on the Pixel 8 Pro Tensor G3 CPU. The model is litert-community/gemma-4-E4B-it-litert-lm from HuggingFace — 2.4GB on disk.

Gotcha #3: Backend.CPU is mandatory on Tensor G3. We tried Backend.GPU (silently hangs — known LiteRT-LM Issue #1860 on Tensor G3) and Backend.NPU (returns NOT_FOUND). CPU is the only path that works. Our engine init now tries NPU first, falls back to CPU on any non-fatal throwable.

Gotcha #4: Concurrent gemma access is a native SIGSEGV. This one cost us about three hours today. Our pre-cleaning worker (more on this below) runs gemma.runInference in the background while the user might tap Clean and trigger another runInference. Two callers on the same engine = null pointer dereference inside liblitertlm_jni.so at offset 0x4c9060. The fix: a kotlinx.coroutines.sync.Mutex around the full runInferenceStreaming body (which spans the suspend, so the second caller actually waits for the first's onDone/onError before its conversation-create runs).

If you take one thing from this post: LiteRT-LM is not thread-safe. Serialize every call.

Gotcha #5: Engine context budget vs decode speed is a real tradeoff on memory-bandwidth-bound mobile CPUs. We tried bumping EngineConfig.maxNumTokens from 2048 to 8192 so longer brain dumps could clean single-shot. Result: per-token decode speed dropped from ~5 tk/s to ~2 tk/s on E4B/Tensor G3. The larger KV cache eats so much memory bandwidth that decode slows down faster than the chunking overhead it saves. Reverted to 2048 with sentence-boundary chunking instead.

If you've only used Gemma in the cloud, this is the lesson: on mobile, you optimize for decode speed, not context length.

Per-chunk pre-cleaning during recording

This is the architectural decision I'm proudest of. Naively, you record → stop → clean. On Tensor G3 CPU, cleaning a 10-minute dump takes 5-10 minutes wall-clock. That's terrible UX.

So we do pre-cleaning: as soon as sherpa-onnx transcribes a 30-second audio chunk, we immediately enqueue that chunk's text for Gemma cleaning in the background. By the time the user stops recording, most of the cleaning is already done. Tap → results in seconds, not minutes.

The implementation is a ChunkCleaner class that owns its own SupervisorJob scope (so it outlives the recording service), reads from a Channel, and writes per-chunk results to sessions//cleanings/_receptive.txt. When the user taps the Clean tab, we check if all chunks have files on disk — if yes, stitch in <100ms; if no, fall back to the on-tap path with no regression.

This pattern generalizes beyond Bidet. If your app does any expensive transform on a stream of inputs and lets the user request the output later, amortize the cost during the stream, not after. The user's wait is whatever's left in the queue when they tap, not the whole queue.

Foreground service or your process dies

Android 14+ will kill any process at "cached" priority when the screen blanks. A 10-minute Gemma decode does not survive that. Our CleanGenerationService is a FOREGROUND_SERVICE_DATA_SYNC foreground service that holds the process at fgs priority for the duration of the inference. The user sees a notification while it runs.

We initially missed this for the History-screen Clean path (we had it for the live-recording path) and burned a recording when the screen blanked mid-decode. The fix was routing the History path through the same CleanGenerationService instead of running gemma in the ViewModel's scope.

If you're shipping on-device inference on Android: every long-running gemma call must be inside a foreground service. No exceptions.

The Unsloth fine-tune (the contest's bonus track)

The Kaggle Gemma 4 contest has a special "Unsloth" prize for fine-tuning a Gemma model on a custom corpus. Bidet's bonus deliverable is a LoRA fine-tune of Gemma 4 E4B on my own brain-dump corpus — a few months of (raw, cleaned) pairs from my actual dumps.

The cleaned labels weren't hand-curated. We auto-distilled them by feeding each raw dump to Gemini 2.5 Pro with a fidelity-first prompt:

Light cleanup. Preserve every proper noun verbatim. Don't invent. Keep speaker voice. Group adjacent thoughts. Drop fillers. ≤30% length reduction.

About 80 (raw, cleaned) pairs come out the other side. One epoch on a free Kaggle T4×2. The fine-tuned model speaks in my voice when it cleans, not in generic-LLM voice.

The pattern generalizes: any user with 50 voice captures and a fidelity-first cleaning prompt can have a small on-device model that cleans in their own voice. The corpus + the prompt + the small-model choice are the three knobs. You don't need hand-labeled data; you just need a frontier model to do the labeling.

Beyond me — for the kids

The wider lens is the part I care about most, and it's also the part that surprised me when it landed.

I teach kids with learning disabilities. Not severe ones — the kind where, with organization and repetition, you can compensate. That's what I've done my whole life without realizing it.

But here's the thing I keep hitting in class. I know this kid knows the material. I teach history. They can tell me the story. They can sit there and walk me through Nixon from Peace with Honor to Watergate. They just can't get it typed out of their fingers or formatted into the five-paragraph structure the rubric asks for.

Imagine if I could give them this:

Tell me the story of Nixon from Peace with Honor to Watergate, and let him go.

Let him get scattered. Let him remember a detail and throw it in. Let him talk for twelve minutes the way I talk for twelve minutes. Then run it through Clean for others with a "Cornell notes" prompt, and I can read his actual understanding instead of guessing at it from a half-page of stilted sentences.

That kid knows the concepts. He's in the concepts. He may not be able to write it. But like me, maybe he can tell me the story. Bidet AI is the bridge between what a brain knows and what a page can show.

What I'd do differently if I started over

A few things that would have saved me time, in case you're shipping your own on-device LLM app:

1. Concurrency review before code. Before any class that touches a shared model, list every caller that could hit it. Identify the mutexes you need. We didn't do this for the pre-cleaner and burned three hours debugging the resulting SIGSEGV.

2. Stable debug keystore in CI. Every CI build with a fresh keystore = uninstall + reinstall every time = full reinstall takes 10 minutes (including 2.4GB model re-push). One-time setup: commit a stable debug.keystore to the repo and use it in CI. Installs go from 10 min to 90 seconds.

3. JVM-side integration tests with a fake Gemma engine. Unit tests don't catch concurrency bugs. A fake engine that simulates "this call takes 10 seconds to return" would have surfaced our mutex bug without any phone install.

4. Pre-record the contest demo. For a contest video, don't depend on a one-shot live recording having clean transcription. Do 5-10 short rehearsed takes, pick the one Moonshine handled cleanest. Engineering effort: zero.

The closer

Bidet AI started because I needed to write report card comments without losing six hours of my life. It's becoming something bigger because the same shape — talk freely, let the model find the structure — works for any brain that goes faster than its fingers. Teachers. Students. Anyone whose useful thoughts come out scattered.

The whole thing runs on the phone in your pocket. It doesn't need a server. It doesn't need a subscription. It doesn't need your data to leave your hand.

Take a brain dump. AI cleans up your mess.

That's it.

Bidet AI is a Kaggle Gemma 4 Good Hackathon submission for the Build with Gemma 4 track. Submission deadline: May 18, 2026. The full contest pitch and source notes are at reports.thebarnetts.info. Built by Mark Barnett — 25 years teaching middle school, currently testing an on-device AI in the actual classroom.

All proper-noun voices in this essay are mine. The structure was assembled by an AI editor; the words are not.