Voice & Multimodal Radar — week of 2026-05-22

You asked for a research agent that catches the next Supertonic before you stumble into it on your phone at 2 PM. This is week one of that radar. The headline is that the voice lane just had its loudest 90 days in a long while. Three of the products you're already using or considering — Moonshine on the Bidet phone, Supertonic on the Ray-Bans path, faster-whisper inside Bidet Quick — all sit downstream of releases shipped between February and now. The supply side of cutting-edge voice AI is open, fast, and on-device. The Bidet thesis is being validated by the field, not just by your gut.

This week's three biggest things, in plain English: Meituan dropped a 3.5B diffusion-based zero-shot voice cloner three days ago that beats Seed-TTS on the standard benchmark and is MIT-licensed — this is the one you didn't see coming. Hume open-sourced TADA in March, a 1B LLM-based TTS that runs at 0.09 real-time factor and produced zero hallucinations across a thousand-sample test — relevant because Bidet's cleaning step occasionally injects junk Supertonic dutifully speaks. Anthropic shipped Claude Code voice mode in early March — you got 5% rollout odds and we should check if you have it. None of these required GPU cloud money.

I'm structuring this as a real newsletter going forward, not a link dump. Each cutting-edge item gets the full Bidet Check inline so you can see in two seconds what to dig into and what to skim. Verdict legend: 🟢 INVESTIGATE means it could plug into the stack this month, 🟡 WATCH means promising but not actionable yet, 🔴 IGNORE means I checked it for you so you don't have to.

TL;DR — the three things this week

1. LongCat-AudioDiT🟢 INVESTIGATE. Meituan open-sourced a 3.5B diffusion-transformer zero-shot voice cloner on 2026-05-19. New SOTA on Seed benchmark, MIT license, operates in waveform latent space (no mel-spectrogram hop). Could replace XTTSv2 in your future voice-clone plans. huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B

2. Hume TADA🟡 WATCH. First open-source TTS from Hume, released 2026-03-10. Llama-3.2-1B base, zero text-acoustic hallucinations on LibriTTSR 1000-sample test, 700-sec long-form context. Llama 3.2 license (not pure MIT). huggingface.co/HumeAI/tada-1b

3. Claude Code voice mode🟢 INVESTIGATE (your stack). Anthropic started rolling out voice mode for Claude Code on 2026-03-03 to ~5% of Pro/Max/Team/Enterprise users. /voice command, push-to-talk on spacebar. We should check if you have it. 9to5mac.com on the rollout

Cutting edge — the three that matter

1 · LongCat-AudioDiT 3.5B — the voice cloner you didn't see coming

This is the one Mark would not have stumbled into on his own. Meituan — yes, the Chinese food-delivery giant — has a serious AI team called LongCat, and on 2026-05-19 they open-sourced a diffusion-transformer zero-shot TTS model called LongCat-AudioDiT in two sizes (1B and 3.5B). The bigger model takes the new state-of-the-art crown on the Seed-TTS benchmark, lifting speaker-similarity scores from 0.809 to 0.818 on Seed-ZH and from 0.776 to 0.797 on Seed-Hard. The interesting part isn't the score — it's the architecture choice.

Where almost every prior open TTS (XTTSv2, F5-TTS, Spark-TTS, IndexTTS, Fish Speech) hops through mel-spectrograms or learned discrete codecs and then resynthesizes audio with a vocoder, LongCat-AudioDiT operates directly in the waveform latent space. The pipeline is just a waveform variational autoencoder plus a diffusion transformer. They explicitly claim this kills the compounding-errors problem and simplifies the engineering surface. They also fixed a long-standing training-inference mismatch that quietly degraded prior diffusion-TTS models and replaced classifier-free guidance with what they call "adaptive projection guidance." Engineering paper, not academic vaporware. MIT licensed. github.com/meituan-longcat/LongCat-AudioDiT · demo page

🧪 BIDET CHECK — LongCat-AudioDiT 3.5B

Lane fitB

CompoundB

Unique angleA

Build costC

48-hr testA

Verdict: 🟢 INVESTIGATE

Lane fit: voice generation is the back half of Bidet AI (answer channel to Ray-Bans). Compound: would slot where Supertonic currently sits, but voice-cloning Mark's own voice is the Whisper-mark Tier 3 dream. Unique angle: it's the new SOTA, and the waveform-latent approach is genuinely novel. Build cost C because 3.5B params needs the 64GB RAM upgrade landing Friday, and the smaller 1B version is the cheap-test option. 48-hour test: yes — you've been talking about voice-cloning corpus pipelines for months. Action: when the RAM lands, drop the 1B into a side experiment using the Bidet Quick corpus the system is already building.

2 · Hume TADA — the “TTS that doesn’t go off-script”

Hume AI is one of the better-funded voice startups (formerly a Cornell affective-computing lab) and on 2026-03-10 they released their first open-source TTS, called TADA — Text-Acoustic Dual Alignment. The headline claim is that across 1,000+ test samples from LibriTTSR, TADA produced zero content hallucinations. For context: every other LLM-based TTS — Spark, IndexTTS, F5, even high-quality commercial systems — will occasionally insert phantom words, repeat phrases, or drop syllables. That's the failure mode you've watched Supertonic do on long sentences. Hume's argument: traditional approaches generate text tokens and acoustic tokens in separate streams that drift; TADA synchronizes them one-to-one in a single stream. The technical claim runs through on the demo — quality is on par with Spark-TTS and IndexTTS but the failure floor is notably lower. Real-time factor 0.09 (more than 5x faster than competing LLM-based TTS) and the model accommodates 700 seconds of audio in a 2048-token context window where competing systems exhaust at 70 seconds. Long-form audiobooks were clearly part of the design brief. hume.ai/blog/opensource-tada · huggingface.co/HumeAI/tada-1b

🧪 BIDET CHECK — Hume TADA-1B

Lane fitB

CompoundC

Unique angleB+

Build costB

48-hr testC

Verdict: 🟡 WATCH

Lane fit: same TTS slot as LongCat. Compound: C because Supertonic-F5 is already working through Ray-Bans, and swapping the path Mark uses 10x/day for a marginal hallucination improvement is high-risk-low-reward today. Unique angle: zero-hallucination claim is real and useful for long-form, but you don't do long-form. License is Llama 3.2 Community (not pure MIT) which matters if Bidet AI ever wants to clean-room license. 48-hr test C because absent a specific failure-mode you're hitting with Supertonic, you'd forget about this. Trigger to revisit: if Supertonic starts mis-speaking the cleaned brain-dump output in ways that bother you, TADA is the obvious A/B test.

3 · Claude Code voice mode — check if you have it

Anthropic quietly started rolling out voice mode for Claude Code on 2026-03-03, beating OpenAI's Codex voice mode by exactly one week. Implementation is the simplest possible: /voice to turn it on, hold spacebar to talk, release to send. Speech is transcribed in real time and pasted into the input field. Initial rollout was ~5% of Pro/Max/Team/Enterprise users and Anthropic said it would ramp over the following weeks. Eleven weeks have passed since launch, so the rollout should be at or near 100% by now, but Anthropic has not publicly confirmed. Worth a 30-second check on your Max plan tonight — type /voice in any Claude Code session and see if it accepts. If it does, that's a hands-free coding loop layered on top of the one you already have through "Computer" wake → /ask → ntfy → Ray-Bans. Stack Junkie's hands-on review · Winbuzzer launch coverage

🧪 BIDET CHECK — Claude Code voice mode

Lane fitA

CompoundA

Unique angleC

Build costA

48-hr testA

Verdict: 🟢 INVESTIGATE

Lane fit A because this is literally Anthropic shipping the Bidet thesis (voice in, AI out) inside the tool you use most. Compound A because it lives inside Claude Code which you already use daily. Unique angle C because Bidet Quick already does this for you anywhere in Windows — the Claude-native version doesn't capture to corpus, and it doesn't run while you're using Cursor or another tool. Build cost A: literally type /voice. 48-hour test A. Action: I should check tonight whether your account has the rollout. If yes, A/B for a session and report back whether it's actually better than Bidet Quick for code dictation, or just more convenient.

Industry moves — the big labs this week

Three production releases hit the voice-agent benchmarks Sierra just published as τ-Voice (more on that below). OpenAI's gpt-realtime-1.5 shipped February 2026 with a 4.1% word error rate vs Whisper-v3's 5.3%, roughly 22% fewer mistakes at the same $0.006 per minute. They also recommend gpt-4o-mini-transcribe over gpt-4o-transcribe now for best results. tokenmix.ai breakdown. Google's gemini-2.5-flash-native-audio went generally available on Vertex AI with 30 HD voices in 24 languages and explicit style-prompt steering ("whisper," "speak slowly," accent control). Worth knowing as the closed-source benchmark to compare open-source TTS against. Google's update blog. Anthropic shipped the Claude Code voice mode (covered above) and has nothing else public in voice this quarter; their lane is text+coding agents, and that's fine.

On the commercial side, Resemble AI's Chatterbox-Turbo deserves a mention. It's a distilled 350M-parameter open TTS that hits 75ms latency and 6x real-time on a single GPU, with native paralinguistic tags ([cough], [laugh], [chuckle], [sigh]) and 5-second voice cloning. Imperceptible Perth watermarking is baked in. MIT-licensed, 649 likes on Hugging Face, 12+ community Spaces hosting demos. huggingface.co/ResembleAI/chatterbox-turbo. Smaller, faster, more permissively licensed than TADA — arguably the cleanest open-source Supertonic alternative if you ever want one with paralinguistic control. Wispr Flow added a "Personalized Style" setting in 2026 (set tone per app, from very-casual to formal) and a Pro-only "Command Mode" that lets you voice-edit highlighted text ("make this more concise"). $15/month, closed-source — relevant only because they're the direct UX competitor to Bidet Quick. The interesting thing is they've quietly become the only major dictation tool on Mac, Windows, iOS, and Android simultaneously. wisprflow.ai

HuggingFace + GitHub watchlist

Stuff worth knowing about but not deep-diving today. Each one would slot into the stack in a specific scenario.

Moonshine v2Pete Warden's team shipped v2 on 2026-02-12 with sliding-window attention. Tiny variant hits 50 ms latency (5.8x faster than Whisper Tiny), Medium hits 258 ms (43.7x faster than Whisper Large v3) at on-par accuracy. This is the upgrade path for your Bidet phone. github.com/moonshine-ai/moonshine

Qwen3-ASRAlibaba's open ASR (0.6B + 1.7B) shipped 2026-01-29. 52 languages with language detection. Free alternative to Whisper-large if multilingual ever matters. github.com/QwenLM/Qwen3-ASR

Qwen3-OmniEnd-to-end omnimodal LLM — text, audio, images, video in; text + speech out, real-time streaming. Closest open competitor to gpt-realtime. github.com/QwenLM/Qwen3-Omni

Voicebox (jamiepine)Open-source voice studio in Rust/Tauri — 5 TTS engines, 23 languages, system-wide dictation, Claude/ChatGPT MCP voice integration. 22k stars as of April. Closest analogue to what Bidet AI could become as a desktop product. github.com/jamiepine/voicebox

SpokenlyStandalone Mac dictation app exposing a local MCP server so any agent can call its STT. Useful pattern reference even if you don't use the app. spokenly.app

pyannote Precision-2+14% accuracy over Precision-1, +28% over OSS 3.1. If diarization (who-said-what in OMI transcripts) ever becomes interesting again, this is the open-source path. pyannote.ai/blog/precision-2

Research papers worth knowing about

Two benchmark papers and one paper that’s honestly more useful as a reading list than as a result. arXiv has been productive on the voice-agent evaluation side this quarter, which usually means the agents themselves are converging on a level worth measuring.

τ-Voice (Sierra Research, 2026-03-17). The most useful research paper of the quarter for someone in Mark's spot. Sierra benchmarked three production voice agents — gpt-realtime-1.5, gemini-live-2.5-flash-native-audio, and grok-voice-agent — on 91 customer-service tasks across three real-world domains. The headline: the best system scored 38% on task completion under realistic conditions (noise, accents, interruptions), half what equivalent text agents achieve. Even on clean audio the best score was 51%. This is the receipt for "voice agents are not actually that close to text agents yet." Useful counterweight to OpenAI/Google marketing. arxiv.org/abs/2603.13686

EVA-Bench (ServiceNow, 2026-05-13). A different angle on the same problem. ServiceNow built an end-to-end framework that orchestrates bot-to-bot audio conversations and scores them on two composite metrics: EVA-A (accuracy — task completion + speech fidelity) and EVA-X (experience — turn-taking, conciseness). The framework lets you directly compare cascade pipelines (STT+LLM+TTS, which is basically what Bidet is) against hybrid (AudioLLM+TTS) and pure speech-to-speech systems. arxiv.org/abs/2605.13841 · huggingface.co/blog/ServiceNow-AI/eva

X-Voice (2026-05-09). Zero-shot cross-lingual voice cloning across 30 languages, based on F5-TTS plus a flow-matching architecture using International Phonetic Alphabet representation. Claims comparable quality to billion-scale models like Qwen3-TTS at smaller size. Honestly — flagging because it's recent and on-topic, but most of the claim is "we matched a bigger model on this specific benchmark." Watch for community independent reproductions before treating the result as solid. arxiv.org/abs/2605.05611

One thing worth noting on the academic side — the paper Mark would actually find useful is "Recent Advances in Speech Language Models: A Survey" (Cui et al., October 2024), which lays out the architectural taxonomy for end-to-end speech LLMs that bypass the ASR-LLM-TTS pipeline entirely. Not new, but it's the right framing for understanding why gpt-realtime, gemini-native-audio, and Qwen3-Omni are all converging on the same architecture pattern this year. hf.co/papers/2410.03751

Mark's stack — what this means concretely

Three actionable threads dropped out of this week's scan. Listed in order of how soon they should land.

This week — check Claude Code voice mode

Mark Type /voice in any Claude Code session tonight. If it works, that's an answer to whether you can dictate code without needing Bidet Quick to be the universal layer. If not, I'll loop back with how to request the rollout. Either way, low cost to find out.

This week — queue Moonshine v2 for the Bidet phone

The Bidet phone runs Moonshine v1 today for on-device STT. v2 hits 50ms latency vs Whisper Tiny's ~290ms (5.8x speedup) at the same accuracy. This is a drop-in replacement, not a rebuild. The Kaggle submission and DEV.to publication are locked, but the next phone build should ship with v2. Add to the backlog as a 30-minute swap once you're not also testing the RAM upgrade.

After the 64GB lands — A/B LongCat-AudioDiT-1B against Supertonic

The 1B version of LongCat-AudioDiT is the cheap experiment. Run it on Apex once the RAM is in, feed it your corpus from Bidet Quick (the system is already silently building a (text, audio) pair archive), and see if it can hit zero-shot voice clone quality with under 30 seconds of your voice. If yes, the Whisper-mark Tier 3 plan (own voice synthesis) gets a serious shortcut. If no, fall back to XTTSv2 fine-tune as originally planned.

What's NOT changing. The Whisper-mark Tier 2 LoRA fine-tune plan stays exactly as written — nothing in this week's scan replaces "fine-tune large-v3 on Mark's voice + Mark's vocabulary." faster-whisper is still the right inference layer; the Bidet Quick corpus is still the right training data. None of the new releases obsolete that path.

Carry-forward watch list — what to look for next Sunday

Skip these

Not worth your time this week

Generic "best TTS in 2026" listicles. BentoML, Hyperstack, and ten other content-marketing sites published the same article. They're all six weeks behind the field. Use Hugging Face's trending feed and Pendrokar's TTS Spaces Arena for real comparison.

VoiceSculptor (January). "Instruction-based voice design" via RAG over speaker descriptions. Interesting research, no practical edge over LongCat or TADA. Skip unless natural-language voice design becomes an explicit Bidet feature.

AISHELL6-whisper (Mandarin whisper-mode dataset). Real research, real dataset, zero relevance to your stack.

xAI grok-voice-agent. Hard skip per anti-Elon policy. It exists, it's a closed-source τ-Voice benchmark entry, that's all you need to know.

Anything tagged "AI agents 2026 awesome list." Two of the top GitHub trending repos this week are aggregator listicles. They're SEO bait. Don't click.

Next radar fires

Sunday 2026-05-31 (week 2). Same URL slug pattern: /private/r/2026-05-31-voice-multimodal-radar.html. Source list and methodology documented in project_voice_multimodal_radar_2026-05-22.md in the memory repo for the next agent that runs this. If something major drops mid-week (e.g., a Claude voice consumer release, a WWDC voice surprise), it overrides the cadence.