Supertonic 3 local TTS — listen test

Voice: M1 (default male English) · Quality steps: 8 (medium) · Speed: 1.05x · Sample rate: 44100 Hz mono

Listen to all 5 in order. The point is to evaluate quality on the exact kinds of strings the Tasker Say pipeline handles today: numbers, dates, acronyms (TP3, OMI, HTTP), punctuation, ntfy alert phrasing, "Computer answer" responses. Then tell me if you want me to replace GoogleTTS with Supertonic in the Ray-Bans speak chain (post-Saturday hardware upgrade).

1. ntfy alert (Bidet down style)

01_bidet_alert · 8.77s audio · 756 KB

"Bidet unreachable. HTTP zero zero zero, time 25 seconds. Likely transient — CF tunnel or Whisper job blocking."

2. "Computer" answer-style response (long sentence, multiple names + times)

02_computer_ask_response · 12.85s audio · 1.1 MB

"You have three calendar events today. First one at 10 AM with William about Priority Landscape, then lunch at 12:30 with Kim, and parent-teacher conference at 3 PM."

3. Numbers + dates + dollar amounts (the worst case for most TTS)

03_numbers_dates · 12.35s audio · 1.0 MB

"B and H package arrives April 22, 2026. Order number one one two nine two one eight zero two three. Total: one thousand five dollars and ninety three cents."

4. Acronyms + semicolons + your tech-speak

04_punctuation_acronyms · 10.31s audio · 888 KB

"TP3 ingest stalled. Newest row is 63 minutes old. Watcher tipping at threshold; OMI webhook flaky, omi-api-poll reliable."

5. Short burst (quick-fire alert)

Performance on G16 CPU (no GPU):

Real-time factor (RTF): 0.22 - 0.36x — meaning audio generates in 22-36% of its playback duration. Anything under 1.0 is faster than real-time. 0.22 is genuinely impressive on CPU.
49 seconds of total audio generated in ~12 seconds wall time. Including model load (one-time, 13.7s).
Memory: ONNX runtime, modest footprint. Post-Saturday upgrade with 64GB RAM, this runs alongside Postgres + ollama + everything else with room to spare.
Disk footprint: model + deps ~200 MB.

Verdict you're being asked to make

After listening, which is it?

YES — sounds noticeably better than GoogleTTS → I deploy Supertonic as an HTTP endpoint on Apex post-Saturday hardware, modify Tasker Say to call it. Ray-Bans audio quality steps up across the board (ntfy speak, Computer answers, every spoken alert).
MEH — about the same as GoogleTTS → not worth the integration work right now. Bank the install for future use (Bidet AI standalone offline pipeline still benefits, but not urgent).
NO — sounds worse → drop it, GoogleTTS stays the default.