Local LLM Harness Plan
Author: G16 Claude (research + architecture)
For: Mark Barnett
Date: 2026-04-13
Status: Plan — not yet built. Ends with a 4-hour build target.
0. The short version
- Pick LiteLLM as the harness. One YAML file, OpenAI-compatible endpoint, handles routing and fallback. Everything else plugs into it.
- Apex (RTX 3090, 24 GB) is the local inference host. G16 stays as mission control — its 8 GB laptop 4070 is not where your daily driver lives.
- Gemma 3 27B QAT (int4) is the 2026 sweet spot for your box. Fits in ~14 GB, leaves headroom for embeddings and context.
- Link the two machines with llama.cpp RPC over Tailscale if and only if you want to run something bigger than 27B. For daily work, you won't need to.
- Stitch in Cerebras, Groq, Gemini Flash-Lite, and Cloudflare Workers AI free tiers as "free cloud fallbacks" behind the harness. Anthropic stays as paid premium.
- No, you don't need a bigger computer yet. The 3090 covers every workload you have. Re-evaluate when Gemma 5 / Llama 5 land or when you want 70B-class reasoning locally.
1. Interchangeable harness pattern — the survey
You asked: one interface, many backends, swap per task. Here are the real options and which one wins.
- LiteLLM Proxy — winner. Single YAML. OpenAI-compatible endpoint fronting Ollama, llama.cpp, Groq, Gemini, Anthropic, Cloudflare Workers AI, OpenRouter — 100+ backends. Built-in routing, retries, fallback, spend tracking. One-command Docker install.
- LangChain LLM abstraction — no. Library, not a service. Forces Python every swap. You're an architect, not a coder.
- Claude Agent SDK — use it as a client, not the harness. It supports custom model providers via an OpenAI-compatible base URL, so Claude Code can point at LiteLLM and LiteLLM routes to Gemma. That's the integration point.
- OpenRouter — use it as one backend inside LiteLLM for niche models, but it's paid and not local.
- aisuite (Andrew Ng) — smaller ecosystem, fewer providers. LiteLLM is a superset.
- llama.cpp server + custom proxy — only as a backend. Write-your-own-proxy is the complexity we're avoiding.
Recommendation: LiteLLM Proxy on Apex, fronting everything. Claude Code on G16 talks to it via OpenAI-compatible base URL. Your CLI tools, scripts, and custom agents all hit the same endpoint.
2. Linking G16 + Apex — can the two computers share inference?
Short answer: yes, but you probably shouldn't for daily work.
The options:
- llama.cpp RPC mode — The real deal. One machine is the "primary" (tokenizer + scheduler), the other runs a lightweight
rpc-serverthat exposes its GPU to the primary. The primary splits model layers across both GPUs using--tensor-split. A 2026 AMD demo ran a 1-trillion-parameter model across four nodes this way. Tailscale gives you the low-latency private network for free — no extra networking setup. This is the right answer if you ever want to run a model that doesn't fit on the 3090 alone. - Ollama distributed — Not natively supported as of April 2026. There are community wrappers (OLOL, Hive) but they're load-balancers across separate Ollama instances, not true layer-splitting. Skip.
- vLLM distributed — Production-grade tensor parallelism. Heavier install, requires matching CUDA environments, worth it only if you're serving many concurrent users. Not your use case.
- Ray Serve — Overkill. Enterprise orchestration. No.
- exo / distributed-inference frameworks — Young, fragile, moving fast. Revisit in six months.
Honest verdict for your workflow:
The 3090 at 24 GB runs everything you actually use (Gemma 3 27B int4, Qwen 32B int4, Llama 3.3 70B Q3 with offload). The G16 4070 Laptop (8 GB) adds ~33% more VRAM if linked — not enough to change what models you can run, and you'd pay for it in network latency on every token.
What G16 should actually do: stay as the client (Claude Code, CLI, editor) hitting the Apex LiteLLM endpoint over Tailscale. If you're on the patio with just the laptop and no Tailscale, G16 can run a local Ollama with Gemma 3 4B as an offline fallback. That's the right split.
Exception: if you ever want to try Llama 3.3 70B at Q4_K_M (~40 GB), llama.cpp RPC across both boxes gives you ~32 GB combined VRAM + system RAM offload and makes it plausible. Document the recipe, don't run it daily.
3. Free web services — which to wire in
LiteLLM makes each of these a 3-line YAML addition. Worth wiring up:
- Cerebras Cloud — 1M tokens/day, 30 RPM, no card. Llama 3.3 70B, Qwen3 32B/235B, GPT-OSS 120B. Fastest inference on the planet. Your default free fast cloud.
- Groq — free tier with RPM/RPD caps, no card. Llama + Mixtral, ultra-low latency. Secondary fallback.
- Gemini Flash-Lite — 15 RPM, 1,000 req/day, 1M context. Feed it whole documents.
- Gemini 2.5 Flash — 10 RPM, 250 req/day. Multimodal when you need vision.
- Cloudflare Workers AI — 10,000 Neurons/day free. Llama 3.3 70B, Qwen QwQ, DeepSeek-R1-Distill. You already use Cloudflare — no new billing surface.
- HuggingFace Inference — small free credit. Niche models only. Low priority.
- Together AI / Mistral La Plateforme — skip unless needed; LiteLLM can add either in 30 seconds.
- Anthropic — no free API tier. You already pay Claude Max — that's the premium lane.
Recommended routing philosophy:
1. Try local Gemma 3 27B first (free, private, no rate limit).
2. If local is busy or the task needs a bigger brain, fall back to Cerebras free (1M tokens/day is huge).
3. If Cerebras is rate-limited, fall back to Gemini Flash-Lite (1K requests/day).
4. If that's gone, fall back to Cloudflare Workers AI (10K Neurons/day).
5. Paid Anthropic Claude is the "premium lane" — called explicitly, not via fallback, so you never accidentally burn tokens.
LiteLLM does all of this from one YAML fallbacks: block.
4. Gemma specifically — what's current, what runs
You named Gemma by name. Here's the state as of this week:
- Gemma 4 (released April 2, 2026): sizes are E2B, E4B, 26B MoE, 31B Dense. Native function calling, structured JSON output, system instructions, multimodal (text + audio + vision), Apache 2.0. This is the current flagship.
- Gemma 3 (still widely supported): sizes 1B, 4B, 12B, 27B. QAT int4 builds published by Google. The 27B QAT fits in ~14 GB VRAM with 128K context.
- Comparison: Gemma 3 27B beats Llama 3.1 70B on several reasoning benchmarks at a third the VRAM — strong on instruction-following and writing. Qwen 3 32B is stronger at code and math. Mistral Small 3 (24B) is faster and excellent at structured output. Llama 3.3 70B still leads open-model reasoning but needs Q3 and heavy offload on a 3090. Gemma 4 31B Dense becomes the new baseline once community int4 builds land (expected within weeks).
- Tool calling / MCP: Gemma 4 has native function calling in all sizes. That means it can drive MCP servers the same way Claude and GPT do. For your workflow — where TP3 and Legacy Soil both want agentic tool use — this is the unlock.
- Which size on the 3090:
- Gemma 3 27B QAT int4 → fits, ~14 GB, your daily driver today.
- Gemma 4 31B Dense Q4 → will fit, ~18-20 GB, your daily driver once Ollama builds drop.
- Gemma 4 26B MoE → fits easily, faster than dense, good for batch work like the daily digest.
- Gemma 4 E4B → fits on G16 laptop, your offline patio model.
Recommended local default on Apex: Gemma 3 27B QAT today → Gemma 4 31B Q4 as soon as the Ollama build is stable.
5. Budget — do you need a bigger computer?
No. Not yet. Here's the reasoning.
Your actual workloads:
- Classroom brain dumps (voice → text → structured notes) — 4B model is plenty.
- Daily digest (summarize overnight TP3 deltas + Drive activity) — 12-27B.
- Content drafting (Legacy Soil pages, cover letters, blog) — 27-31B.
- Some coding — Claude Max handles anything serious; local Gemma or Qwen 32B for "quick refactor."
None of that needs more than 24 GB VRAM. The 3090 is genuinely sufficient. The only thing it can't do comfortably is Llama 70B-class models at Q4 — and for those, you have Cerebras free tier giving you 70B at 2,000 tokens/sec for free, which is faster than any local 70B rig short of a dual-A100 box.
If you did upgrade, the 2026 sweet spots:
1. Used second 3090 ($700-900) — 48 GB total. Best dollar-per-VRAM. Case/PSU check needed.
2. RTX 5090 (~$2,000) — 32 GB, faster. Incremental.
3. Used A6000 ($3,500-4,500) — 48 GB single card. "Stop thinking about VRAM" option.
4. Mac Studio M4 Ultra ($5,000+) — unified memory runs Llama 70B unquantized. Different category.
5. Wait for Blackwell consumer refresh late 2026.
My recommendation: Current hardware is fine for 6-9 months. Re-evaluate when Gemma 5 / Llama 5 drop and when you've actually hit a wall Cerebras-free can't cover. If you do hit that wall, a second used 3090 in Apex is the best move.
6. Architecture sketch
+-----------------------------------+
| G16 (laptop) |
| |
| Claude Code (mission control) |
| CLI tools / editor / scripts |
| Offline fallback: Ollama + |
| Gemma 4 E4B (patio mode) |
+----------------+------------------+
|
| Tailscale (private, low-latency)
| OpenAI-compatible API calls
v
+--------------------------------------------------+
| APEX (Win11, 3090) |
| |
| +------------------------------------------+ |
| | LiteLLM Proxy (Docker, port 4000) | |
| | - Unified OpenAI-compatible endpoint | |
| | - Routing table (task -> backend) | |
| | - Fallback cascade | |
| | - Spend + token logging | |
| +-----+------+------+------+------+--------+ |
| | | | | | |
| v v v v v |
| +--------+ +----+ +-----+ +-----+ +-------+ |
| | Ollama | | CB | |Gemin| | CF | |Anthro | |
| | local | |free| | free| |WAI | | paid | |
| | Gemma | |1M/d| |1K/d | |10K | | Max | |
| | 27/31B | | | | | |neur | | | |
| +--------+ +----+ +-----+ +-----+ +-------+ |
| |
| (TP3 Neural Stack runs beside it, untouched) |
+--------------------------------------------------+
Routing logic (LiteLLM router block):
task: "brain_dump"→ Gemma 3 27B local, fallback Cerebras Llama 3.3 70Btask: "long_context_summary"→ Gemini Flash-Lite (1M ctx), fallback Gemma localtask: "daily_digest"→ Gemma 4 26B MoE, fallback Cerebrastask: "code"→ Qwen 32B local, fallback Claude paid (explicit, not auto)task: "premium_reasoning"→ Claude paid, no fallback (intentional)default→ Gemma 27B local → Cerebras → Gemini → Cloudflare WAI
Deployment:
- Apex: LiteLLM in Docker, Ollama as a Windows service (already running with your 628K row embedding stack — don't touch it). LiteLLM registers the existing Ollama instance as a backend.
- G16: nothing installed except pointing Claude Code and any CLI tools at
http://apex.tailnet:4000/v1as the OpenAI base URL. One environment variable. - How Mark talks to it:
1. Claude Code (already set up) — change ANTHROPIC_BASE_URL or use a model alias to route through LiteLLM when you want a local model.
2. A brain CLI wrapper — brain "dump this: ..." hits LiteLLM with your default task routing.
3. MCP bridge — expose LiteLLM as an MCP server so any Claude Desktop / Antigravity session can call it as a tool.
What does NOT change:
- TP3 Neural Stack runs untouched. Postgres + pgvector + MinIO keep doing their thing.
- Your existing Ollama on Apex keeps its model cache and its embedding workload. LiteLLM is a wrapper, not a replacement.
- Your existing Claude Max subscription is unchanged and remains the premium lane.
7. Build this weekend — 4-hour minimum viable
Goal: By Sunday night, Mark can say brain "summarize today's TP3 deltas" from the G16 and get an answer from Gemma 27B on Apex — with Cerebras as a free fallback and Claude as the explicit premium lane.
Hour 1 — LiteLLM on Apex (script, no clicking):
I will write one script for Apex Claude to run. The script:
1. Installs Docker Desktop if missing (skip if present).
2. Writes C:\litellm\config.yaml with: existing Ollama backend (Gemma 3 27B + whatever else is already pulled), Cerebras free, Gemini free, Cloudflare Workers AI, Anthropic paid. Env vars for keys read from a .env file — I will rotate any keys needed, you won't copy-paste.
3. Launches the LiteLLM container on port 4000 bound to the Tailscale IP.
4. Registers a Windows scheduled task so it starts with Apex (like the TP3 Mission Control pattern we already use).
Hour 2 — G16 client glue:
1. brain CLI: one bash script in ~/bin/brain that posts to http://apex.tailnet:4000/v1/chat/completions with task routing.
2. Claude Code: set env var so a /local slash-command or a --local flag routes through LiteLLM to Gemma instead of Anthropic when I explicitly ask.
3. Test round-trip from the patio: laptop → Tailscale → Apex → Gemma → back.
Hour 3 — Pull Gemma 4 and benchmark:
1. On Apex, ollama pull gemma3:27b-it-qat (current) and attempt ollama pull gemma4:31b-it if the build has landed this week.
2. Run a three-prompt benchmark: brain dump summary, daily digest against TP3 row counts, Legacy Soil paragraph rewrite. Capture tokens/sec and subjective quality.
3. Commit results to the AI_Library as Gemma_Benchmark_2026-04-13.md so we have a baseline.
Hour 4 — Wire in the free fallbacks and smoke-test them:
1. Verify Cerebras key works (I will issue the curl, you don't touch a browser unless the signup requires it — and if it does, I hand you a prefilled link).
2. Verify Gemini Flash-Lite and Cloudflare Workers AI keys.
3. Force a local failure (stop Ollama), confirm LiteLLM cascades to Cerebras automatically, restart Ollama, confirm it flips back.
4. Add one Slack notification: "LiteLLM fallback fired — used [provider] because [local/primary] failed." So you know when you're drifting off local without checking logs.
Deliverables at the end of the 4 hours:
- Apex: LiteLLM running, auto-starting, surviving reboot.
- G16:
brainCLI working, Claude Code optional local routing working. - Shared Google Doc (via Zapier MCP) with the benchmark numbers.
- One-paragraph status in Slack: "Local harness live. Gemma 27B daily driver. Cerebras + Gemini + Cloudflare as free fallback. Claude Max untouched. Daily cost floor is now $0 for routine work."
What I need from you before I start: nothing. I will draft the scripts, rotate any keys, issue the install commands through Apex Claude, and report back. If a signup form blocks me (Cerebras / Groq sometimes do), I will hand you exactly one link, not a 20-step checklist.
Sources: LiteLLM docs, llama.cpp RPC README, Gemma 3/4 release notes, Cerebras/Groq/Gemini/Cloudflare pricing — verified April 2026.