Mark's Reports

plan · 2026-04-14

Local LLM Harness Plan

Author: G16 Claude (research + architecture)

For: Mark Barnett

Date: 2026-04-13

Status: Plan — not yet built. Ends with a 4-hour build target.


0. The short version


1. Interchangeable harness pattern — the survey

You asked: one interface, many backends, swap per task. Here are the real options and which one wins.

Recommendation: LiteLLM Proxy on Apex, fronting everything. Claude Code on G16 talks to it via OpenAI-compatible base URL. Your CLI tools, scripts, and custom agents all hit the same endpoint.


2. Linking G16 + Apex — can the two computers share inference?

Short answer: yes, but you probably shouldn't for daily work.

The options:

Honest verdict for your workflow:

The 3090 at 24 GB runs everything you actually use (Gemma 3 27B int4, Qwen 32B int4, Llama 3.3 70B Q3 with offload). The G16 4070 Laptop (8 GB) adds ~33% more VRAM if linked — not enough to change what models you can run, and you'd pay for it in network latency on every token.

What G16 should actually do: stay as the client (Claude Code, CLI, editor) hitting the Apex LiteLLM endpoint over Tailscale. If you're on the patio with just the laptop and no Tailscale, G16 can run a local Ollama with Gemma 3 4B as an offline fallback. That's the right split.

Exception: if you ever want to try Llama 3.3 70B at Q4_K_M (~40 GB), llama.cpp RPC across both boxes gives you ~32 GB combined VRAM + system RAM offload and makes it plausible. Document the recipe, don't run it daily.


3. Free web services — which to wire in

LiteLLM makes each of these a 3-line YAML addition. Worth wiring up:

Recommended routing philosophy:

1. Try local Gemma 3 27B first (free, private, no rate limit).

2. If local is busy or the task needs a bigger brain, fall back to Cerebras free (1M tokens/day is huge).

3. If Cerebras is rate-limited, fall back to Gemini Flash-Lite (1K requests/day).

4. If that's gone, fall back to Cloudflare Workers AI (10K Neurons/day).

5. Paid Anthropic Claude is the "premium lane" — called explicitly, not via fallback, so you never accidentally burn tokens.

LiteLLM does all of this from one YAML fallbacks: block.


4. Gemma specifically — what's current, what runs

You named Gemma by name. Here's the state as of this week:

Recommended local default on Apex: Gemma 3 27B QAT today → Gemma 4 31B Q4 as soon as the Ollama build is stable.


5. Budget — do you need a bigger computer?

No. Not yet. Here's the reasoning.

Your actual workloads:

None of that needs more than 24 GB VRAM. The 3090 is genuinely sufficient. The only thing it can't do comfortably is Llama 70B-class models at Q4 — and for those, you have Cerebras free tier giving you 70B at 2,000 tokens/sec for free, which is faster than any local 70B rig short of a dual-A100 box.

If you did upgrade, the 2026 sweet spots:

1. Used second 3090 ($700-900) — 48 GB total. Best dollar-per-VRAM. Case/PSU check needed.

2. RTX 5090 (~$2,000) — 32 GB, faster. Incremental.

3. Used A6000 ($3,500-4,500) — 48 GB single card. "Stop thinking about VRAM" option.

4. Mac Studio M4 Ultra ($5,000+) — unified memory runs Llama 70B unquantized. Different category.

5. Wait for Blackwell consumer refresh late 2026.

My recommendation: Current hardware is fine for 6-9 months. Re-evaluate when Gemma 5 / Llama 5 drop and when you've actually hit a wall Cerebras-free can't cover. If you do hit that wall, a second used 3090 in Apex is the best move.


6. Architecture sketch


                  +-----------------------------------+
                  |          G16 (laptop)             |
                  |                                   |
                  |  Claude Code (mission control)    |
                  |  CLI tools / editor / scripts     |
                  |  Offline fallback: Ollama +       |
                  |      Gemma 4 E4B (patio mode)     |
                  +----------------+------------------+
                                   |
                                   |  Tailscale (private, low-latency)
                                   |  OpenAI-compatible API calls
                                   v
         +--------------------------------------------------+
         |                  APEX (Win11, 3090)              |
         |                                                  |
         |   +------------------------------------------+   |
         |   |   LiteLLM Proxy  (Docker, port 4000)     |   |
         |   |   - Unified OpenAI-compatible endpoint   |   |
         |   |   - Routing table (task -> backend)      |   |
         |   |   - Fallback cascade                     |   |
         |   |   - Spend + token logging                |   |
         |   +-----+------+------+------+------+--------+   |
         |         |      |      |      |      |            |
         |         v      v      v      v      v            |
         |   +--------+ +----+ +-----+ +-----+ +-------+    |
         |   | Ollama | | CB | |Gemin| | CF  | |Anthro |    |
         |   | local  | |free| | free| |WAI  | | paid  |    |
         |   | Gemma  | |1M/d| |1K/d | |10K  | | Max   |    |
         |   | 27/31B | |    | |     | |neur | |       |    |
         |   +--------+ +----+ +-----+ +-----+ +-------+    |
         |                                                  |
         |   (TP3 Neural Stack runs beside it, untouched)   |
         +--------------------------------------------------+

Routing logic (LiteLLM router block):

Deployment:

1. Claude Code (already set up) — change ANTHROPIC_BASE_URL or use a model alias to route through LiteLLM when you want a local model.

2. A brain CLI wrapper — brain "dump this: ..." hits LiteLLM with your default task routing.

3. MCP bridge — expose LiteLLM as an MCP server so any Claude Desktop / Antigravity session can call it as a tool.

What does NOT change:


7. Build this weekend — 4-hour minimum viable

Goal: By Sunday night, Mark can say brain "summarize today's TP3 deltas" from the G16 and get an answer from Gemma 27B on Apex — with Cerebras as a free fallback and Claude as the explicit premium lane.

Hour 1 — LiteLLM on Apex (script, no clicking):

I will write one script for Apex Claude to run. The script:

1. Installs Docker Desktop if missing (skip if present).

2. Writes C:\litellm\config.yaml with: existing Ollama backend (Gemma 3 27B + whatever else is already pulled), Cerebras free, Gemini free, Cloudflare Workers AI, Anthropic paid. Env vars for keys read from a .env file — I will rotate any keys needed, you won't copy-paste.

3. Launches the LiteLLM container on port 4000 bound to the Tailscale IP.

4. Registers a Windows scheduled task so it starts with Apex (like the TP3 Mission Control pattern we already use).

Hour 2 — G16 client glue:

1. brain CLI: one bash script in ~/bin/brain that posts to http://apex.tailnet:4000/v1/chat/completions with task routing.

2. Claude Code: set env var so a /local slash-command or a --local flag routes through LiteLLM to Gemma instead of Anthropic when I explicitly ask.

3. Test round-trip from the patio: laptop → Tailscale → Apex → Gemma → back.

Hour 3 — Pull Gemma 4 and benchmark:

1. On Apex, ollama pull gemma3:27b-it-qat (current) and attempt ollama pull gemma4:31b-it if the build has landed this week.

2. Run a three-prompt benchmark: brain dump summary, daily digest against TP3 row counts, Legacy Soil paragraph rewrite. Capture tokens/sec and subjective quality.

3. Commit results to the AI_Library as Gemma_Benchmark_2026-04-13.md so we have a baseline.

Hour 4 — Wire in the free fallbacks and smoke-test them:

1. Verify Cerebras key works (I will issue the curl, you don't touch a browser unless the signup requires it — and if it does, I hand you a prefilled link).

2. Verify Gemini Flash-Lite and Cloudflare Workers AI keys.

3. Force a local failure (stop Ollama), confirm LiteLLM cascades to Cerebras automatically, restart Ollama, confirm it flips back.

4. Add one Slack notification: "LiteLLM fallback fired — used [provider] because [local/primary] failed." So you know when you're drifting off local without checking logs.

Deliverables at the end of the 4 hours:

What I need from you before I start: nothing. I will draft the scripts, rotate any keys, issue the install commands through Apex Claude, and report back. If a signup form blocks me (Cerebras / Groq sometimes do), I will hand you exactly one link, not a 20-step checklist.


Sources: LiteLLM docs, llama.cpp RPC README, Gemma 3/4 release notes, Cerebras/Groq/Gemini/Cloudflare pricing — verified April 2026.