Gemma 4 E4B audio-mode cold-start on Pixel 8 Pro — research

Date: 2026-05-09. For: Mark Barnett, bidet-phone, Kaggle Gemma 4 Good Hackathon (deadline 2026-05-18).

TL;DR

Three things are true and they compound. (1) The shared LiteRT-LM Engine in BidetSharedLiteRtEngineProvider.kt is built with backend = Backend.GPU() but the Pixel 8 Pro's Tensor G3 has no OpenCL — Google's own LiteRT-LM tracker has this filed as Issue #1860 and the Edge Gallery app itself warns Pixel 8 users that "E4B will exceed memory and can crash" (HF discussion #2). (2) Gemma 4 audio mode on Android is early — not production-ready — and the only public working demo is Google's own AI Edge Gallery "Audio Scribe", which uses E2B (~2.5 GB), not E4B (~3.66 GB). (3) The 68-second startForegroundDelayMs is the AOSP ActiveServices metric SystemClock.elapsedRealtime() − createRealTime — i.e. the system's measurement of how long it took the service to call startForeground() after creation, not engine load time. That points at the synchronous Hilt/inject path on RecordingService colliding with the GPU-init failure inside the shared engine provider, not at the model file size.

Recommendation, lowest risk first: ship a dual-flavor APK: Whisper (default) + Gemma audio (experimental toggle). The Cactus prize description explicitly rewards routing between local + cloud / multiple models — Whisper-tiny → Gemma-text routing fits that frame at least as well as single-model audio-in. Then file two upstream issues (LiteRT-LM, HF discussion) so the bug-report itself is part of the contest narrative.

The problem in plain English

When you tap Record in the gemma flavor:

Android starts RecordingService (a foreground service for the microphone).
Hilt has to inject the BidetSharedLiteRtEngineProvider and a few other things.
That provider, the first time it's acquire()-d, builds a 3.66 GB LiteRT-LM Engine configured for Backend.GPU() + audio CPU encoder.
On Pixel 8 Pro the GPU constructor silently succeeds — but the GPU is fake, because Tensor G3 has no OpenCL library. So the engine sits in initialize() doing setup work that's going nowhere fast, on whatever thread the first acquire() was called from.
After ~60 seconds of no startForeground() callback making it back to the system, Android's ActiveServices watchdog logs startForegroundDelayMs:68453 and kills the process.

Mark's existing code already has the placeholder-notification trick (startForegroundWithStartingPlaceholder() is called before the engine work, on the onStartCommand thread). So the placeholder should register inside Android's 5-second window. The fact that it doesn't suggests the failure is earlier than performAsyncStartup — most likely the Hilt entry-point is doing something blocking, or the placeholder notification path itself is being delayed by a competing main-thread caller (e.g. the Compose Activity tearing down to background while bind is still pending).

What we know is true (verified)

1. `Backend.GPU()` on Pixel 8 Pro does not work — confirmed open bug

"On Pixel 8 Pro (Tensor G3), constructing Backend.GPU() succeeds without throwing any exception. However, when inference is attempted, it crashes with: Can not find OpenCL library. The Tensor G3 chip does not expose OpenCL, so GPU inference is not available on this device. The current behaviour is misleading: construction appears to succeed, which gives the caller no indication that GPU is unavailable, and the error only surfaces at runtime during inference."

— LiteRT-LM Issue #1860, opened 2026-04-14, still open.

The reporter recommends manually catching and falling back to Backend.CPU(), or maintaining a chip allowlist. There is no SDK API to detect GPU availability before committing.

Mark's current config (BidetSharedLiteRtEngineProvider.kt lines ~228):

return EngineConfig(
    modelPath = modelFile.absolutePath,
    backend = Backend.GPU(),   // ← will silently fail on Tensor G3
    visionBackend = null,
    audioBackend = if (requireAudio) Backend.CPU() else null,
    maxNumTokens = maxNumTokens,
    cacheDir = cacheDir,
)

This is the single most likely root cause. Backend.CPU() on Tensor G3 is the only path that actually runs end-to-end.

2. Gemma 4 E4B does not fit reliably on Pixel 8 — confirmed by Google itself

"I downloaded the E4B model for use with Edge Gallery on a Pixel 8. It will crash the app immediately when attempting to benchmark or chat... E2B works. I also missed the edge gallery update. It even notified me that E4B will exceed memory and can crash."

— HF discussion #2 on litert-community/gemma-4-E4B-it-litert-lm, April 2026.

A second user (HF discussion #10) reports identical crash on Qualcomm 8+ Gen 1 12/512GB. The LiteRT Community recommended fix is "update to AI Edge Gallery 1.011" + switch to E2B.

E4B is ~3.66 GB on disk; Pixel 8 Pro has 12 GB system RAM. Resident weights + KV cache + audio encoder + Android system processes leaves very little headroom. Gallery itself gates E4B behind a memory-warning dialog for this exact reason.

3. The 68-second number is from `ActiveServices.java`, and it measures service-create-to-startForeground

From AOSP frameworks/base/services/core/java/com/android/server/am/ActiveServices.java:

final long delayMs = SystemClock.elapsedRealtime() - r.createRealTime;
if (delayMs > mAm.mConstants.mFgsStartForegroundTimeoutMs) {
    ...
    final String temp = "startForegroundDelayMs:" + delayMs;

Source: ActiveServices.java line ~2495.

This is logged when a service was created but startForeground() was not called within mFgsStartForegroundTimeoutMs (default ~10s on most builds). The 68453ms means the service object existed for 68 seconds before startForeground() finally landed (or never did). The r.mStartForegroundCount == 0 branch fires only on the startService() (not startForegroundService()) launch path — so check whether the launch site is calling Context.startForegroundService(intent) vs plain startService(intent).

4. Engine init time guidance is "up to 10 seconds — use a background thread"

Google's own Android getting-started doc:

"The engine.initialize() method can take a significant amount of time (e.g., up to 10 seconds) to load the model. It is strongly recommended to call this on a background thread or coroutine to avoid blocking the UI thread."

10 seconds is for the text-only models on a healthy backend. Multimodal (audio + vision) compounds that. There is no initializeAsync() or prewarm() API surface in the public Kotlin Engine class (engine.h on main) — you manage threading yourself.

There is a cacheDir parameter that "can improve 2nd load time" by caching compiled GPU graphs — but the C-API setter has a known bug (Issue #2152, fixed but not in 0.11) that doesn't propagate cacheDir to vision/audio executors, so multimodal models re-compile their encoder on every engine create. Mark's app sets cacheDir = context.getExternalFilesDir(null)?.absolutePath, which is fine — but on 0.11 the audio-encoder portion of that cache may not be honored.

5. There is a real, related, "field report" with 10 findings — relevant context

Issue #2202 (open, 2026-05-08, no response yet) is a sustained-testing field report from Lee at "UncreatedLabs" running Gemma 4 E2B on Pixel 10a (Tensor G4, Mali GPU). Findings 1, 2, 3, 4, 9 are all relevant; finding #9 directly references "Foreground service start from BroadcastReceiver interacts badly with Conversation lifecycle" — they couldn't get a clean repro but flagged it. Mark's bug is the cleaner, reproducible repro of that.

6. There's a documented audio-encoder lazy-load behaviour

"The audio encoder is offloaded from memory by the Engine as no active Session requires it."

— Multiple sources including LiteRT-LM README.

This means the audio encoder is a sub-asset that gets paged in/out as sessions need it. First use of audio (i.e. Mark's first Record tap) triggers a lazy load of the audio encoder weights — which is additional cost on top of the base Gemma load. This is consistent with the observed 60+ s delay being audio-mode-specific.

7. Pixel 8 / Pixel 9 GPU on LiteRT-LM has multiple still-open bugs

#1850 — Pixel 8, 0.10.0, Gemma 4 E2B, GPU decode fails with clEnqueueNDRangeKernel - Invalid command queue. The reporter built 0.10.1 from source with three custom patches to make it work. OPEN.
#1681 — gpu_model_builder.cc misidentifies GPU as PowerVR on Tensor G5. OPEN.
#41 on gallery — Gemma 3n E4B int4 crashes on Pixel 9 with GPU acceleration. (Pixel 9 has Tensor G4.)
Gallery #557 — "Failed to create engine" loading Gemma-4-E4B-it on MediaTek Dimensity 9200+. OPEN.

Pattern: Gemma 4 + GPU + Android is currently a sharp-edged path. CPU is the well-trodden route.

8. The standard cold-start fix in working public Android-Gemma apps is "init in `Application.onCreate`"

From the most recent published prototype I found (Commencis Android Gemma 3n offline-inference prototype, Medium, April 2026):

"Moving initialization to Application's onCreate removed a long first-use delay... CPU inference consistently beats GPU by about 20%, especially during cold starts."

This is the canonical pattern and Mark's gemma flavor doesn't currently do it.

What others have done (working precedent)

Google's AI Edge Gallery — Audio Scribe screen (source: github.com/google-ai-edge/gallery). Uses MediaPipe-based runner with E2B (not E4B) on Pixel devices. Loads on app start, not record-time. The only first-party working demo. No Hilt singleton service pattern; uses MediaPipe LlmInference.createFromOptions() (their previous-generation API surface) rather than LiteRT-LM directly.
Commencis prototype — Galaxy S24/S22, CPU backend, init in Application.onCreate, MediaPipe API. Source is the Medium write-up above (no public GitHub).
Edge Gallery Audio Scribe — runs on Pixel 8 Pro per user reports, but the gallery app shows a memory-exceed warning before allowing E4B to load on Pixel 8, and recommends E2B.
No public end-to-end working demo of Gemma 4 E4B audio mode on Tensor G3 / Pixel 8 Pro that I could find. None on GitHub. None on the LiteRT-LM samples directory. The closest precedent is E2B + Pixel 10a in Lee's field-report (Issue #2202), which works on a Mali GPU not present on Pixel 8.

Workarounds, ranked by likelihood-to-land-in-9-days

A. Switch GPU → CPU on Tensor G3 + add a chip allowlist (< 30 min, near-zero risk)

In BidetSharedLiteRtEngineProvider.buildEngineConfig:

val backend: Backend = when {
    Build.SOC_MODEL.startsWith("Tensor G3") -> Backend.CPU()  // No OpenCL on G3
    isOpenClAvailable() -> Backend.GPU()
    else -> Backend.CPU()
}

This single change is the most likely fix for the foreground-service ANR. Per Issue #1860, GPU on Tensor G3 doesn't work; the silent-success-then-fail behavior is exactly what produces multi-tens-of-seconds dead waits.

B. Pre-warm the engine in `BidetApplication.onCreate` on a background thread (2-3h, low risk)

@HiltAndroidApp
class BidetApplication : Application() {
    @Inject lateinit var sharedEngineProvider: BidetSharedLiteRtEngineProvider
    override fun onCreate() {
        super.onCreate()
        // Don't block — kick off the load and let it finish in the background.
        CoroutineScope(SupervisorJob() + Dispatchers.Default).launch {
            try {
                sharedEngineProvider.acquire(requireAudio = true, maxNumTokens = MAX_TOKENS)
            } catch (t: Throwable) {
                // Log; user will see the error when they tap Record.
            }
        }
    }
}

By the time the user finds the Record button, the engine is warm. Trade-off: this commits ~3.6 GB of working set as soon as the app starts, which means the OS may kill bidet-phone in the background more aggressively. Acceptable for a contest demo where the app is the foreground task.

C. Add a "Loading model…" gate to the Record button (1-2h, low risk)

Bind the Compose Record button to sharedEngineProvider.isReady. If the engine isn't warm, show "Loading model… 28%" with a determinate progress (estimated from elapsed time). User can't tap Record until ready. Combine with B, not a substitute. The gate kills the demo crash; B kills the wait time.

D. Switch from E4B to E2B (few hours, depends on quality acceptance test)

E2B is ~2.5 GB (vs 3.66 GB for E4B) and is the model the working public demo uses. Per HF discussion #2, "E2B works" on Pixel 8, "E4B crashes." It's worse on benchmarks but if the contest demo has to not crash on the judge's device, E2B is the safer pick.

E. Pin LiteRT-LM to 0.10.x or 0.11.0 cleanly (1h)

Lee's field report (#2202) is on 0.10.x. Issue #1850 has a working 0.10.1-from-source patch for Pixel 8 GPU. Issue #2225 (just opened 2026-05-09) reports a new SIGSEGV on 0.11.0 Linux/Vulkan. The 0.11.0 release may itself be the regression — check what version Mark's gradle file actually pins.

F. Run the engine in `android:process=":ai"` ISOLATED from the recording service (half day, medium risk)

But: Issue #2028 reports SIGSEGV on second createConversation() when LiteRT-LM is in a :ai sub-process on iQOO. Don't do this. Stay in the main process.

G. File the issue upstream + ask (see "Where to ask for help")

If A-D don't solve it in a day, the bug needs to go upstream so it's not just Mark's problem.

The honest fallback: ship Whisper as the default APK

Mark's whisper flavor:

Whisper-tiny is 39 MB, loads in <1 s.
No Tensor G3 GPU dependency.
No 60-second cold-start crash.
Already shipped + verified live on 2026-05-09 (per BIDET_PHONE_BRIEF and 75 s/702 char transcript).

The Cactus prize narrative explicitly rewards routing:

"Cactus is a low-latency engine for mobile devices & wearables that runs locally on edge devices with hybrid routing of complex tasks to cloud models like Gemini and Google DeepMind."

(Cactus on YC, search results 2026-05-09)

Whisper-tiny → Gemma-text is two-stage routing on-device, which is the same pattern Cactus rewards (small + specialized → larger + generalist). The single-model audio-in path is sexier but untested in the wild and publicly known to crash on the most likely judge device.

Decision frame: do you want to ship a working app that demonstrates routing (E2B + Whisper, dual-flavor), or a slide-deck demo of an unfinished single-model architecture (E4B audio, currently broken)?

Mark's principle from MEMORY.md: "correct over fast, polished over rough." The polished artifact is Whisper-flavor + a Gemma-cleanup tab. The rough artifact is the gemma-flavor as it currently exists.

Where to ask for help

In rough order of "how likely they answer":

1. HuggingFace discussion on the model card (highest signal — Google staff watch this)

URL: https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/discussions/new

The marissaw / chettydetty / linazhao128 maintainers are responsive (1-2 day turnaround per existing threads). HF discussion #2 already establishes this is the expected behaviour on Pixel 8.

2. GitHub issue on `google-ai-edge/LiteRT-LM` (the canonical bug tracker)

URL: https://github.com/google-ai-edge/LiteRT-LM/issues/new

Use the draft body below. The maintainers (whhone, yuhuichen1015) triage actively. Close-rate on filed issues is decent; open-rate higher. Lee's #2202 is sitting unresponded, which means the team is currently bandwidth-constrained for "diagnostic field report" type issues — a tight, single-symptom, single-repro report has a much better chance of getting attention than a 10-finding survey.

3. Google AI Edge Discord (publicly indexed)

The official AI Edge community chat. Slower than GitHub but better for "is this expected? how do other folks handle it?" questions.

4. Kaggle competition forum — Gemma 4 Good Hackathon

URL: https://www.kaggle.com/competitions/gemma-4-good-hackathon/discussion

Appropriate venue. Kaggle competition discussions are explicitly Q&A. If other contestants are hitting the same wall, the maintainers will surface a hot-fix or workaround note. Don't ask "is my app broken" — ask "is anyone else hitting startForegroundDelayMs on Tensor G3 with E4B audio mode?"

5. r/androiddev (low priority — wrong audience for this niche)

Most r/androiddev folks have never heard of LiteRT-LM. Skip.

Draft GitHub issue body

To file at https://github.com/google-ai-edge/LiteRT-LM/issues/new. Copy-paste-ready:

# `Backend.GPU()` + Gemma 4 E4B audio mode on Pixel 8 Pro Tensor G3 → 68s startForegroundDelayMs ANR (Hilt-injected service)

## Environment

- **Device:** Pixel 8 Pro, Google Tensor G3, 12 GB RAM, Android 15
- **LiteRT-LM:** `com.google.ai.edge.litertlm:litertlm-android:0.11.0`
- **Model:** [`litert-community/gemma-4-E4B-it-litert-lm`](https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm) (~3.66 GB `.litertlm` on disk)
- **Backend config:** `backend = Backend.GPU()`, `audioBackend = Backend.CPU()`, `maxNumTokens = 8192`
- **App:** [bidet-ai/bidet-phone](https://github.com/bidet-ai/bidet-phone) — fork of `google-ai-edge/gallery`, Hilt-injected `RecordingService` with `foregroundServiceType="microphone"`
- **Architecture:** singleton `BidetSharedLiteRtEngineProvider` (`@Singleton @Inject`) holds the live `Engine`; first acquire is from inside `RecordingService.performAsyncStartup()` on `Dispatchers.Default`

## Symptom

User taps Record → app dies ~60s later, no session row written, no error UI surfaced. Logcat shows:

ActivityManager: ...startForegroundDelayMs:68453


The OS-level mic indicator dot blinks on (capture engine started successfully) and then off when the process dies.

## Reproduction

1. Fresh install gemma-flavor APK on Pixel 8 Pro (Tensor G3).
2. Complete first-run model download (E4B `.litertlm`, 3.66 GB to external files dir).
3. Grant `RECORD_AUDIO` permission.
4. Tap Record from welcome screen (no prior chat-tab interaction → `BidetSharedLiteRtEngineProvider.acquire(requireAudio = true, ...)` is the first call).

Expected: recording UI appears within a few seconds, capture pipeline starts.
Actual: 60+ second wait, then `startForegroundDelayMs` log, process killed by `ActiveServices`. No `INFO`-level log from LiteRT-LM ever lands in logcat — the engine appears stuck in `initialize()`.

## Hypothesis

Per [#1860](https://github.com/google-ai-edge/LiteRT-LM/issues/1860): `Backend.GPU()` constructs without exception on Tensor G3 even though Tensor G3 has no OpenCL. We suspect `Engine.initialize()` is doing OpenCL discovery work that has no fast-fail path on this device, blocking the calling thread for tens of seconds before either crashing or falling back. Combined with our service being created via Hilt + a 3.66 GB model file, total time-to-`startForeground` exceeds the AOSP `mFgsStartForegroundTimeoutMs` window.

## Asks

1. **Fast-fail or graceful CPU fallback in `Backend.GPU()` constructor** when OpenCL is unavailable (related: [#1860](https://github.com/google-ai-edge/LiteRT-LM/issues/1860), still open). Even a "no OpenCL, throwing" log line at construct time would let app code make a one-line fallback decision.
2. **Public guidance on multi-gigabyte audio-mode model load** in a foreground service context — either an `initializeAsync` API that returns a Flow of progress events, or sample code showing the canonical pattern for this on Android 14/15.
3. **Confirmation that Pixel 8 Pro Tensor G3 + E4B audio mode is currently expected to work end-to-end.** If it's not, a note in the model card README would have saved us a week.

Happy to attach the full logcat capture (~80 KB), the `EngineConfig` dump, and the `BidetSharedLiteRtEngineProvider` source if it'd help triage. Also happy to test any patch you'd want validated on Tensor G3 — that's our daily-driver dev device.

cc @whhone @yuhuichen1015 (per assignees on related audio-backend issues)

Sources

Primary — LiteRT-LM upstream

LiteRT-LM main GitHub repository
Issue #1860 — Backend.GPU() silently fails on Pixel 8 Pro — OpenCL not found (open, 2026-04-14)
Issue #1850 — Gemma 4 E2B on Pixel 8: 0.10.0 fails on GPU decode, self-built 0.10.1 patch fixes it (open, 2026-04-04)
Issue #1859 — FunctionGemma 270M SIGSEGV on Pixel 8 Pro CPU backend (open)
Issue #1575 — Audio backend performance poor on Android phone (open, 2026-03-24)
Issue #2028 — SIGSEGV on second createConversation in :ai sub-process (open)
Issue #2114 — GPU Engine init fails on Galaxy S26 Exynos
Issue #2152 — cache_dir does not propagate to vision/audio executors (closed, fix landed)
Issue #2202 — Field report: Gemma 4 E2B + LiteRT-LM 0.10.x on Mali-G715 (open, 2026-05-08)
Issue #2225 — v0.11.0 Linux/Vulkan SIGSEGV in libLiteRtWebGpuAccelerator.so (open, 2026-05-09 — possible 0.11 regression)
Engine.h on main
Kotlin getting-started doc
Kotlin Main.kt example
LiteRT-LM Android getting-started
LiteRT-LM Overview

HuggingFace model discussions

Working demos / prototypes

Google AI Edge Gallery (open-source)
Audio Scribe blog post
Commencis Gemma 3n Android offline-inference prototype (Medium) — the canonical "init in onCreate" pattern
DEV.to: Gemma 4 on Mobile — E2B vs E4B

Gallery cross-references

AOSP for the `startForegroundDelayMs` semantics

Mark's app code referenced

bidet-ai/bidet-phone repo
Android/src/app/src/main/java/com/google/ai/edge/gallery/bidet/llm/BidetSharedLiteRtEngineProvider.kt
Android/src/app/src/main/java/com/google/ai/edge/gallery/bidet/service/RecordingService.kt
Android/src/app/src/main/java/com/google/ai/edge/gallery/bidet/transcription/GemmaAudioEngine.kt