Mark's Reports

Beyond STT: Architectures past discrete transcribe-then-clean for Bidet AI

Date: 2026-05-10 Question Mark asked: "Speech is the fastest human-to-computer right now... is there a more direct route? Higher tech, next level?" Bidet's shape: voice in -> text transcript -> AI cleanup -> text out (three discrete steps).


TL;DR for the implementer

Discrete transcribe -> clean -> text is the right shape for Bidet for the foreseeable horizon (5+ years) because Bidet's output is text someone reads. The "skip the transcript" architectures (speech-to-speech, audio-to-intent, BCI) are real and shipping, but they win when the output is also non-text (audio reply, app action, neural prosthesis). Bidet's product surface is a written brain-dump. The transcript is not a bottleneck — it is the deliverable.

The next-level architecture move that does apply to Bidet is collapsing the pipeline into a single audio-conditioned LLM pass: audio embeddings stream directly into a Gemma/USM-style backbone that emits the cleaned text in one forward pass, no separate Whisper stage. Gemma 3n already implements this shape on-device.


1. Speech-to-speech bidirectional models

Gemini Live (Google)

Gemini 2.0 Flash Live is natively multimodal — audio, vision, and text share one model pass. The Live API streams audio in and produces audio out without a discrete intermediate transcript. Cloud-only. Pricing $0.00165/min.

GPT-4o Realtime (OpenAI)

Native audio-to-audio. Preserves prosody, sarcasm, urgency, hesitation in a single pass. Cloud-only. WebRTC + WebSocket endpoints. ~$0.30/min (182x more than Gemini Live).

What they're architecturally doing

Both compress audio into discrete or continuous tokens that the LLM treats as just another modality. The LLM emits audio tokens that a vocoder reconstructs into waveform. No round-trip through readable English text.

On-device viability

Neither runs on-device today. Both require cloud GPUs.

Bidet relevance

Low. Bidet's value is the cleaned text artifact — a written transcript Mark hands to a class, pastes into a doc, stores. A speech-to-speech model would need a separate transcription tap to produce that artifact, which puts you right back at the discrete pipeline.

Sources: - https://speko.ai/benchmark/openai-vs-gemini-live - https://ai.google.dev/gemini-api/docs/live-api


2. Voxtral (Mistral, open-weights)

Architecture

Voxtral 4B is a transformer + flow-matching + neural codec stack built on Ministral 3B. Voxtral Realtime is a streaming STT variant with a custom causal audio encoder, configurable 240ms-2.4s delay.

On-device viability

Yes — 4B in BF16 is ~8 GB, quantized ~3 GB. Runs on phones with quantization. Open-weights.

Bidet relevance

Medium. Voxtral Realtime is a credible Whisper replacement if we want streaming on-device with sub-second first-token latency. It is still "audio -> text"; it does not skip the transcribe step. It just makes that step faster, on-device, and open-weight.

Sources: - https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602 - https://venturebeat.com/technology/mistral-drops-voxtral-transcribe-2-an-open-source-speech-model-that-runs-on


3. SeamlessM4T (Meta)

Architecture

Two-stage seq2seq: encoder produces translated text, then a UnitY2 decoder produces speech "unit tokens," then a HiFi-GAN vocoder produces waveform. Optimized for translation, not brain-dump cleanup.

On-device viability

281M small variant exists for edge inference. unity.cpp ships GGML bindings.

Bidet relevance

Low. SeamlessM4T's payoff is cross-lingual speech-to-speech translation. Bidet does monolingual cleanup. The architecture is informative but not useful here.

Source: https://huggingface.co/facebook/seamless-m4t-v2-large


4. Speech LLMs (audio-embeddings-direct-to-LLM)

What it actually is

The architectural pattern: an audio encoder (Whisper, w2v-BERT, Conformer, USM) produces audio embeddings; a thin "modality adapter" maps those embeddings into the same latent space as the LLM's text token embeddings; the LLM (Gemma, Llama, Qwen) then attends to audio + text jointly and emits text. There is no intermediate written transcript — the LLM consumes audio embeddings directly and writes the cleaned output.

This is the most architecturally interesting option for Bidet because it collapses transcribe -> clean into one forward pass while keeping text as the output.

Production examples in 2026

Bidet relevance

HIGH. This is the architectural "next level" that fits Bidet's product surface. Mark's contest is on a Gemma 4 stack, which is the literal direct successor to this pattern. The Cactus Special Tech prize ($10K) rewards exactly this: "intelligently routes tasks between models" on-device — i.e., let Gemma do double duty as both transcriber and cleaner instead of running Whisper and Gemma serially.

Cost

Annotation cost: lower than voice-to-intent (no labeled intents needed) — paired audio + clean-text examples are abundant.

Sources: - https://docs.nvidia.com/nemo-framework/user-guide/24.12/nemotoolkit/multimodal/speech_llm/intro.html - https://ai.google.dev/gemma/docs/gemma-3n - https://www.mindstudio.ai/blog/gemma-4-audio-encoder-e2b-e4b-speech-recognition


5. Direct audio-to-action / voice-to-intent

What it is

Map a spoken command directly to a structured intent + slot values. No transcript at all. Picovoice Rhino, Resemble, Deepgram all ship this today.

Performance

Sub-200ms latency. ~30% lower intent classification error than cascaded ASR+NLU. Picovoice claims 30-50 KB model size, full on-device.

Bidet relevance

Zero. Bidet is not a command system. It's an open-ended brain-dump. There's no fixed intent grammar to compile. Voice-to-intent is the right tool for "play song X," "set timer 5 min," not "ramble about a 6th-grade history lesson for 90 seconds."

Source: https://picovoice.ai/products/voice/speech-to-intent/


6. Always-on continuous on-device STT

What it is

Pixel's Live Caption + Google ML Kit GenAI Speech Recognition (powered by Gemma Nano via Android AICore, alpha in 2026). Always-listening stage uses a low-power DSP pattern-matcher; full STT only fires on detected speech.

Battery cost

Modern always-on uses dedicated low-power silicon; impact is "a few percentage points per day" — small relative to display/cellular.

Bidet relevance

Medium. This is a UX variant of Bidet, not an architectural alternative. It would let Mark drop the "tap to record" gesture entirely. Could be a v0.4 feature: phone always listens, surfaces a cleaned chunk after a natural pause. Architecturally still transcribe -> clean, just triggered by VAD instead of by tap.

Sources: - https://picovoice.ai/blog/android-speech-recognition/ - https://news.ycombinator.com/item?id=19371890


7. Brain-computer interfaces

Honest 2026 status

Honest read

BCI is real medicine for ALS / locked-in / paralysis patients. For someone whose voice and hands work, BCI offers no throughput advantage over speech (~150 wpm conversational) and requires brain surgery. Even the best academic BCI throughput (78 wpm) is slower than talking and has a 25% word error rate.

Bidet relevance

Zero for the next 5+ years. BCI is not a faster route for users with intact speech. It is a parallel route for users without it.

Sources: - https://www.nature.com/articles/s41586-023-06377-x - https://synchron.com/ - https://neuralink.com/


8. The honest take on Bidet specifically

Bidet's pipeline is audio -> transcript -> cleaned text -> human reads. The "transcribe step" is not friction to be eliminated — it is the mid-point of a write-once-read-many artifact. The final consumer is a human reading text; you cannot skip producing text.

Where the discrete pipeline genuinely could collapse: fuse transcribe + clean into one Gemma 3n / Gemma 4 audio-conditioned forward pass. The audio embedding goes directly into the LLM, which emits cleaned text. No separate Whisper stage. This: - Cuts latency (one model load, one forward pass). - Cuts model footprint (one model on disk instead of Whisper + Gemma). - Improves accuracy in noisy conditions (LLM can resolve ambiguity using context that pure ASR can't see). - Maps directly onto the contest's Cactus Special Tech prize ("local-first mobile that intelligently routes tasks between models").

This is not "skipping text." It is "skipping the intermediate text that nobody was going to read anyway." The deliverable text is still produced. That's the architectural next level that respects Bidet's product surface.

What does not apply: speech-to-speech (no audio output needed), voice-to-intent (no fixed grammar), BCI (works, intact users). What is a UX variant: always-on continuous capture (worth considering for v0.4).

When to revisit this verdict


Cited sources