Project Euphonia + Tab 4 redesign

2026-05-09 - research dossier for the bidet-phone contest submission. Two parts: (1) what Euphonia actually is and whether it is integratable, (2) the "understand" vs "be understood" axis question that surfaced from Tab 4.

One-line verdicts

Euphonia integration: partial - the dataset and trained models are not public, but Google ships an Apache-2.0 toolkit on GitHub for collecting personal speech samples and fine-tuning open-source ASR. Euphonia-the-corpus is closed; Euphonia-the-method is open and reproducible.

Tab 4 redesign: split into two clearly-named tabs. The two axes (helping a user understand vs. helping a user be understood) serve different populations with different barriers and different output shapes; collapsing them into one prompt picker hides half the value.

Part 1 - Project Euphonia, deep dive

What it is, and what is currently true (May 2026 reading)

Project Euphonia is Google Research's program to make automatic speech recognition (ASR) work for people whose speech falls outside the narrow distribution that mainstream ASR was trained on - dysarthria (ALS, cerebral palsy, Parkinson's, post-stroke), Down syndrome speech, deaf speech, stuttering, and accent variation. It launched in 2019 and has grown from a corpus-collection effort into a public methods-release.^[1][2]

The published numbers as of the February 2025 Frontiers paper:^[2]

Over 1.5 million utterances from ~3,000 speakers in the proprietary English corpus.
International expansion in progress with 132 speakers across Spanish (38), Japanese (76), French (14), and Hindi (4).
Evaluation metric: LATTEScore (a meaning-preservation score backed by Gemini Nano-1, ROC AUC ~0.89 on French and Spanish test sets) rather than raw word error rate. The paper explicitly does not publish a head-to-head WER table against Whisper or USM.

The performance claim worth quoting comes from a separate Google Research paper on personalization with limited data: personalized models reach 62% relative WER improvement for ALS speakers and 35% for accented speech, with absolute WER landing at roughly 10% for mild dysarthria and 20% for more severe cases.^[3] A 2025 EUSIPCO paper using TTS-augmented dysarthric speech reduces character error rate from 36-51% down to 7.3% and WER from 65-76% down to 18.3%.^[4] The pattern is clear: speaker-adaptive personalization - not a generic better-than-Whisper model - is what makes disordered-speech ASR work.

What is actually shippable for an outside developer

This is the part the headlines miss. The Euphonia dataset and the trained Google models are not publicly downloadable. The official FAQ says so directly: "Project Euphonia's dataset is not available to outside teams or companies."^[1] What Google does ship is open-source infrastructure under Apache-2.0 on GitHub:^[5]

project-euphonia-app - a Flutter mobile app for recording paired (audio, transcript) phrases into a Firebase bucket the developer controls.
Colab notebooks - reference recipes for fine-tuning open-source ASR models on the data the developer collected.
Cloud Run deployment scaffold - turn the fine-tuned model into a private web service.

There is no Euphonia REST API, no model on Hugging Face, no quota or pricing page. The path Google explicitly endorses is "use our toolkit, collect your own data, fine-tune your own model." For larger-scale data needs, Google points outside developers to the Speech Accessibility Project at the University of Illinois Urbana-Champaign, which is the academic-consortium dataset Google itself contributes to.^[1]

What about Project Relate?

Relate is the consumer Android app that sits on top of the Euphonia models. It onboards a user with 500 recorded phrases and builds a personalized model "within days," then offers live captions of the user's speech, a synthesized clearer-voice replay, and Google Assistant integration.^[6][7] The hard fact for the contest: "Relate is not accepting new users at the moment" per the official Project Relate site.^[6] There is no exposed API, no model export, no SDK. Existing users keep working; outsiders cannot wrap it.

The closest viable alternatives

If the goal is "personalized ASR for non-standard speech inside a shipping app," the realistic 2026 options are:

Voiceitt 2 - a commercial API specifically for dysarthric speech, accessible through their SDK and ChatGPT integration. The technology is production-ready and has a clinical trial running through 2026.^[8] It is a paid service, not open-source.
Whisper fine-tuning on the user's own samples - the same pattern Mark is already running for the Whisper-mark project. Public Whisper checkpoints (large-v3) plus a few hours of in-domain audio plus a LoRA adapter is the open-source equivalent of what Euphonia does internally.
WhisperKitAndroid / whisper.cpp on-device - on-device inference is solved; on-device fine-tune is not (it is still a server-side step), but the personalized adapter can be downloaded and merged at first run.^[9]

Verdict for the contest submission

bidet-phone cannot truthfully claim "Project Euphonia integration" - there is nothing to integrate. It can truthfully claim alignment with the Euphonia methodology: speaker-adaptive personalization on top of an open ASR backbone, mirroring the pattern Google itself documents. The honest framing for the submission narrative:

"bidet-phone follows the speaker-adaptation pattern documented by Project Euphonia (Tobin et al., 2025) and the Speech Accessibility Project: a public Whisper backbone is fine-tuned on a user's own recorded samples, producing a personal model that matches their voice instead of the population average. The corpus is the user's, not ours."

A 1-2 evening proof of concept is feasible: ship a 10-phrase enrollment screen in the app that mirrors the Euphonia audio-tool flow (Apache-2.0, citable), package the captured WAV+text into the same JSONL format the Colab notebooks consume, and surface a "personalize this device" button that triggers the existing Whisper-mark pipeline. The contest readers can audit the methodology paper and see the pattern matches.

Part 2 - Tab 4 redesign: "understand" vs "be understood"

Mark's framing question:

"Is the output to help themselves understand, or is the output to help them be understood? That's the customizable one. Do I need to rephrase this project? Do I need to have two customizable tabs? One so that you can be understood, and one so that you can understand?"

The research literature confirms these are two distinct accessibility paradigms with different evidence bases, different populations, and different output shapes. A single tab cannot cover both without losing the affordance.

The two-axis taxonomy

	AXIS A - Output to help the user understand	AXIS B - Output to help the user be understood
Direction	External text/speech is reformatted FOR the user.	The user's speech is normalized so others can parse it.
Populations served	Reading-related: phonological-processing differences (dyslexia), motor-writing differences (dysgraphia), tangent-driven thinking, low-literacy, English-language learners, intellectual disability, post-stroke aphasia (receptive).	Speech-production differences: dysarthria (ALS, CP, Parkinson's, stroke), stuttering, articulation differences, deaf speech, post-stroke aphasia (expressive), heavy accent in a non-native language.
Barrier	Cognitive load on decoding, sequencing, or sustaining attention. The information exists; the path to it is steep.^[10][11][12]	Listener parse-rate. The intent exists; transmission is degraded by motor, fluency, or articulation noise.^[3][4][13]
What AI output looks like	Simplification, summarization, syllable-aware rephrasing, structural annotation (bold/highlight/underline), tangent-organizer outlines, plain-language passes.	Cleaned transcript, disfluency-removed paraphrase, "what I think they meant" interlocutor summary, formal-tone rewrite, clinical-grade preserve-disfluencies transcript.
Evidence quality	Strong. LARF (n=148) showed 5.87/10 vs 5.51 control on retrieval, p=.034, with strongest gains for severe dyslexia.^[12] LaMPost (n=19) found rewrite and subject-line features were popular but accuracy thresholds were not yet met by 2022 LLMs.^[10]	Strong. Personalized ASR: 62% relative WER improvement on ALS speech.^[3] Aphasia + ChatGPT case work shows more sentences with fewer errors when self-content is paired with AI editing.^[14]

Do they overlap?

Yes - one person can sit on both axes (a dyslexic adult with mild dysarthria; an English-language learner with a stutter; a stroke survivor with both expressive aphasia and reading fatigue). This is exactly the argument against hiding the second axis. If both modes are needed, both need a discoverable home in the UI. A single "customizable" picker forces the user to know which mode they want before they can pick a chip, and the chip labels can't telegraph the mode without becoming long.

Augmentative and Alternative Communication (AAC) literature treats these as separate clinical functions: receptive support vs. expressive support, the latter further split into "communication breakdown repair" vs. "primary expressive output."^[13][15] bidet-phone's UI should honor that split.

Preset chips - Axis A (understand)

Short sentences, literal language

Rewrite at roughly an 8th-grade reading level. Replace idioms with literal phrasing. One idea per sentence. Bold dates, names, and numbers. (Backed by LARF annotation results showing strongest gains for severe-dyslexia readers.^[12])

Bullets, not paragraphs

Convert to a short bulleted list with a one-line lead. Keep verbs concrete. Hide subordinate clauses. (For readers whose decoding fatigue spikes on long lines.)

Tangent-organizer

Group related ideas, surface the through-line, label digressions explicitly as "side note." Preserve all content - reorder, do not delete. (For tangent-driven thinking; mirrors graphic-organizer interventions in the executive-function literature.^[16])

Plain-language English (ELL-friendly)

Use the 3,000 most common English words. Avoid phrasal verbs. Define jargon inline on first use. (For English language learners and low-literacy readers.)

K-2 read-aloud

Convert to a story a 7-year-old can follow. Short sentences, concrete nouns, named characters where helpful. (For early readers and intellectual-disability accessibility.)

Preset chips - Axis B (be understood)

Clean for AI consumption

Smooth disfluencies, expand abbreviations, pick the most likely word at hesitation points. Output is what a downstream LLM should parse. Removes filled pauses ("um," "uh"), repetitions, and false starts. (Mirrors Voiceitt's normalization layer.^[8])

Clinical-grade transcript

Preserve disfluencies, repetitions, and timing. Mark unclear words with [?]. Useful for SLP review, IEP documentation, or self-monitoring. (The opposite default of the AI-clean preset; both are legitimate.)

Formal-tone rewrite

Rewrite the cleaned transcript in workplace or academic register. Preserve the speaker's content and intent; upgrade the surface. (For users who want to "sound right" in a written context.)

Interlocutor summary

Write a one-sentence "what I think they meant" gloss, plus the cleaned transcript below. Designed for a hearing partner who cannot fully parse the original speech in real time. (Mirrors AAC communication-breakdown-repair patterns.^[13])

Letter-perfect dictation

Treat the speech as draft input for a written document. Add punctuation, capitalize correctly, paragraph appropriately. The user's words; the user's thoughts; cleaner-than-spoken format. (For users who find typing harder than speaking, including stroke survivors using ChatGPT-assisted writing per the aphasia case-report literature.^[14])

Naming the two tabs

"Customizable tab" is too generic. After the research, the cleanest pair that does not require disability labels and is intelligible to a stranger reading the contest demo:

"Read for me" - takes external content (a paragraph, an email, a textbook page, a transcript of someone else talking) and reshapes it for the user. Axis A.
"Speak for me" - takes the user's own speech and reshapes it for an outside reader, listener, or downstream AI. Axis B.

Alternates that surfaced and were rejected: "Translate FOR me / FROM me" reads clever but loads the verb "translate" with cross-language baggage. "Input mode / Output mode" is engineer-speak. "Assist reading / Assist speaking" is fine but slightly clinical. "Read for me / Speak for me" parses on first read and matches what the buttons actually do.

Recommendation - one tab or two?

Two. The research evidence base is independent on each axis, the populations partially overlap but not fully, and the user has to know "is the source me or someone else?" before any preset chip becomes meaningful. A single tab forces that disambiguation into prose explanation that contest judges and first-time users will not read.

Concretely for bidet-phone post-PR-#17:

Rename Tab 4 ("Customizable") to "Read for me". The four chips already shipped (short-sentences, bullets, tangent-organizer, plain-language) all live here. They are all Axis A. No work is wasted.
Add Tab 5 "Speak for me". Five chips: AI-clean, clinical, formal, interlocutor-summary, letter-perfect. This is the new surface and the contest narrative hook - it is the half of the accessibility space the rest of the field underserves.
Default tab on app launch is whichever was used last. New users see "Speak for me" first because that is bidet-phone's differentiator vs. existing reading-support apps.
Public copy avoids clinical labels. Each chip has a one-line description that tells the user what the output will look like, not what condition it is for.

This split also cleanly aligns with the Euphonia methodology framing in Part 1: "Speak for me" is the surface the personalized-ASR work plugs into. The two parts of this dossier converge on the same UI.