You asked for a research agent that catches the next Supertonic before you stumble into it on your phone at 2 PM. This is week one of that radar. The headline is that the voice lane just had its loudest 90 days in a long while. Three of the products you're already using or considering — Moonshine on the Bidet phone, Supertonic on the Ray-Bans path, faster-whisper inside Bidet Quick — all sit downstream of releases shipped between February and now. The supply side of cutting-edge voice AI is open, fast, and on-device. The Bidet thesis is being validated by the field, not just by your gut.
This week's three biggest things, in plain English: Meituan dropped a 3.5B diffusion-based zero-shot voice cloner three days ago that beats Seed-TTS on the standard benchmark and is MIT-licensed — this is the one you didn't see coming. Hume open-sourced TADA in March, a 1B LLM-based TTS that runs at 0.09 real-time factor and produced zero hallucinations across a thousand-sample test — relevant because Bidet's cleaning step occasionally injects junk Supertonic dutifully speaks. Anthropic shipped Claude Code voice mode in early March — you got 5% rollout odds and we should check if you have it. None of these required GPU cloud money.
I'm structuring this as a real newsletter going forward, not a link dump. Each cutting-edge item gets the full Bidet Check inline so you can see in two seconds what to dig into and what to skim. Verdict legend: ๐ข INVESTIGATE means it could plug into the stack this month, ๐ก WATCH means promising but not actionable yet, ๐ด IGNORE means I checked it for you so you don't have to.
/voice command, push-to-talk on spacebar. We should check if you have it. 9to5mac.com on the rolloutThis is the one Mark would not have stumbled into on his own. Meituan — yes, the Chinese food-delivery giant — has a serious AI team called LongCat, and on 2026-05-19 they open-sourced a diffusion-transformer zero-shot TTS model called LongCat-AudioDiT in two sizes (1B and 3.5B). The bigger model takes the new state-of-the-art crown on the Seed-TTS benchmark, lifting speaker-similarity scores from 0.809 to 0.818 on Seed-ZH and from 0.776 to 0.797 on Seed-Hard. The interesting part isn't the score — it's the architecture choice.
Where almost every prior open TTS (XTTSv2, F5-TTS, Spark-TTS, IndexTTS, Fish Speech) hops through mel-spectrograms or learned discrete codecs and then resynthesizes audio with a vocoder, LongCat-AudioDiT operates directly in the waveform latent space. The pipeline is just a waveform variational autoencoder plus a diffusion transformer. They explicitly claim this kills the compounding-errors problem and simplifies the engineering surface. They also fixed a long-standing training-inference mismatch that quietly degraded prior diffusion-TTS models and replaced classifier-free guidance with what they call "adaptive projection guidance." Engineering paper, not academic vaporware. MIT licensed. github.com/meituan-longcat/LongCat-AudioDiT · demo page
Hume AI is one of the better-funded voice startups (formerly a Cornell affective-computing lab) and on 2026-03-10 they released their first open-source TTS, called TADA — Text-Acoustic Dual Alignment. The headline claim is that across 1,000+ test samples from LibriTTSR, TADA produced zero content hallucinations. For context: every other LLM-based TTS — Spark, IndexTTS, F5, even high-quality commercial systems — will occasionally insert phantom words, repeat phrases, or drop syllables. That's the failure mode you've watched Supertonic do on long sentences. Hume's argument: traditional approaches generate text tokens and acoustic tokens in separate streams that drift; TADA synchronizes them one-to-one in a single stream. The technical claim runs through on the demo — quality is on par with Spark-TTS and IndexTTS but the failure floor is notably lower. Real-time factor 0.09 (more than 5x faster than competing LLM-based TTS) and the model accommodates 700 seconds of audio in a 2048-token context window where competing systems exhaust at 70 seconds. Long-form audiobooks were clearly part of the design brief. hume.ai/blog/opensource-tada · huggingface.co/HumeAI/tada-1b
Anthropic quietly started rolling out voice mode for Claude Code on 2026-03-03, beating OpenAI's Codex voice mode by exactly one week. Implementation is the simplest possible: /voice to turn it on, hold spacebar to talk, release to send. Speech is transcribed in real time and pasted into the input field. Initial rollout was ~5% of Pro/Max/Team/Enterprise users and Anthropic said it would ramp over the following weeks. Eleven weeks have passed since launch, so the rollout should be at or near 100% by now, but Anthropic has not publicly confirmed. Worth a 30-second check on your Max plan tonight — type /voice in any Claude Code session and see if it accepts. If it does, that's a hands-free coding loop layered on top of the one you already have through "Computer" wake โ /ask โ ntfy โ Ray-Bans. Stack Junkie's hands-on review · Winbuzzer launch coverage
/voice. 48-hour test A. Action: I should check tonight whether your account has the rollout. If yes, A/B for a session and report back whether it's actually better than Bidet Quick for code dictation, or just more convenient.Three production releases hit the voice-agent benchmarks Sierra just published as τ-Voice (more on that below). OpenAI's gpt-realtime-1.5 shipped February 2026 with a 4.1% word error rate vs Whisper-v3's 5.3%, roughly 22% fewer mistakes at the same $0.006 per minute. They also recommend gpt-4o-mini-transcribe over gpt-4o-transcribe now for best results. tokenmix.ai breakdown. Google's gemini-2.5-flash-native-audio went generally available on Vertex AI with 30 HD voices in 24 languages and explicit style-prompt steering ("whisper," "speak slowly," accent control). Worth knowing as the closed-source benchmark to compare open-source TTS against. Google's update blog. Anthropic shipped the Claude Code voice mode (covered above) and has nothing else public in voice this quarter; their lane is text+coding agents, and that's fine.
On the commercial side, Resemble AI's Chatterbox-Turbo deserves a mention. It's a distilled 350M-parameter open TTS that hits 75ms latency and 6x real-time on a single GPU, with native paralinguistic tags ([cough], [laugh], [chuckle], [sigh]) and 5-second voice cloning. Imperceptible Perth watermarking is baked in. MIT-licensed, 649 likes on Hugging Face, 12+ community Spaces hosting demos. huggingface.co/ResembleAI/chatterbox-turbo. Smaller, faster, more permissively licensed than TADA — arguably the cleanest open-source Supertonic alternative if you ever want one with paralinguistic control. Wispr Flow added a "Personalized Style" setting in 2026 (set tone per app, from very-casual to formal) and a Pro-only "Command Mode" that lets you voice-edit highlighted text ("make this more concise"). $15/month, closed-source — relevant only because they're the direct UX competitor to Bidet Quick. The interesting thing is they've quietly become the only major dictation tool on Mac, Windows, iOS, and Android simultaneously. wisprflow.ai
Stuff worth knowing about but not deep-diving today. Each one would slot into the stack in a specific scenario.
Two benchmark papers and one paper that’s honestly more useful as a reading list than as a result. arXiv has been productive on the voice-agent evaluation side this quarter, which usually means the agents themselves are converging on a level worth measuring.
τ-Voice (Sierra Research, 2026-03-17). The most useful research paper of the quarter for someone in Mark's spot. Sierra benchmarked three production voice agents — gpt-realtime-1.5, gemini-live-2.5-flash-native-audio, and grok-voice-agent — on 91 customer-service tasks across three real-world domains. The headline: the best system scored 38% on task completion under realistic conditions (noise, accents, interruptions), half what equivalent text agents achieve. Even on clean audio the best score was 51%. This is the receipt for "voice agents are not actually that close to text agents yet." Useful counterweight to OpenAI/Google marketing. arxiv.org/abs/2603.13686
EVA-Bench (ServiceNow, 2026-05-13). A different angle on the same problem. ServiceNow built an end-to-end framework that orchestrates bot-to-bot audio conversations and scores them on two composite metrics: EVA-A (accuracy — task completion + speech fidelity) and EVA-X (experience — turn-taking, conciseness). The framework lets you directly compare cascade pipelines (STT+LLM+TTS, which is basically what Bidet is) against hybrid (AudioLLM+TTS) and pure speech-to-speech systems. arxiv.org/abs/2605.13841 · huggingface.co/blog/ServiceNow-AI/eva
X-Voice (2026-05-09). Zero-shot cross-lingual voice cloning across 30 languages, based on F5-TTS plus a flow-matching architecture using International Phonetic Alphabet representation. Claims comparable quality to billion-scale models like Qwen3-TTS at smaller size. Honestly — flagging because it's recent and on-topic, but most of the claim is "we matched a bigger model on this specific benchmark." Watch for community independent reproductions before treating the result as solid. arxiv.org/abs/2605.05611
One thing worth noting on the academic side — the paper Mark would actually find useful is "Recent Advances in Speech Language Models: A Survey" (Cui et al., October 2024), which lays out the architectural taxonomy for end-to-end speech LLMs that bypass the ASR-LLM-TTS pipeline entirely. Not new, but it's the right framing for understanding why gpt-realtime, gemini-native-audio, and Qwen3-Omni are all converging on the same architecture pattern this year. hf.co/papers/2410.03751
Three actionable threads dropped out of this week's scan. Listed in order of how soon they should land.
Mark Type /voice in any Claude Code session tonight. If it works, that's an answer to whether you can dictate code without needing Bidet Quick to be the universal layer. If not, I'll loop back with how to request the rollout. Either way, low cost to find out.
The Bidet phone runs Moonshine v1 today for on-device STT. v2 hits 50ms latency vs Whisper Tiny's ~290ms (5.8x speedup) at the same accuracy. This is a drop-in replacement, not a rebuild. The Kaggle submission and DEV.to publication are locked, but the next phone build should ship with v2. Add to the backlog as a 30-minute swap once you're not also testing the RAM upgrade.
The 1B version of LongCat-AudioDiT is the cheap experiment. Run it on Apex once the RAM is in, feed it your corpus from Bidet Quick (the system is already silently building a (text, audio) pair archive), and see if it can hit zero-shot voice clone quality with under 30 seconds of your voice. If yes, the Whisper-mark Tier 3 plan (own voice synthesis) gets a serious shortcut. If no, fall back to XTTSv2 fine-tune as originally planned.
What's NOT changing. The Whisper-mark Tier 2 LoRA fine-tune plan stays exactly as written — nothing in this week's scan replaces "fine-tune large-v3 on Mark's voice + Mark's vocabulary." faster-whisper is still the right inference layer; the Bidet Quick corpus is still the right training data. None of the new releases obsolete that path.
Generic "best TTS in 2026" listicles. BentoML, Hyperstack, and ten other content-marketing sites published the same article. They're all six weeks behind the field. Use Hugging Face's trending feed and Pendrokar's TTS Spaces Arena for real comparison.
VoiceSculptor (January). "Instruction-based voice design" via RAG over speaker descriptions. Interesting research, no practical edge over LongCat or TADA. Skip unless natural-language voice design becomes an explicit Bidet feature.
AISHELL6-whisper (Mandarin whisper-mode dataset). Real research, real dataset, zero relevance to your stack.
xAI grok-voice-agent. Hard skip per anti-Elon policy. It exists, it's a closed-source τ-Voice benchmark entry, that's all you need to know.
Anything tagged "AI agents 2026 awesome list." Two of the top GitHub trending repos this week are aggregator listicles. They're SEO bait. Don't click.
Sunday 2026-05-31 (week 2). Same URL slug pattern: /private/r/2026-05-31-voice-multimodal-radar.html. Source list and methodology documented in project_voice_multimodal_radar_2026-05-22.md in the memory repo for the next agent that runs this. If something major drops mid-week (e.g., a Claude voice consumer release, a WWDC voice surprise), it overrides the cadence.