plan · 2026-04-14

AI Radar — Multi-Agent Design

Filed 2026-04-13. Architecture blueprint for a daily-running AI/tooling scout that closes the "I didn't know that existed" gap. Born from the chrome-devtools-mcp miss on 2026-04-13.

Why this exists

Mark loses hours because capability gaps in his agents are invisible until something fails. Tonight it was chrome-devtools-mcp — a tool that would have unblocked browser work in thirty seconds had sat on GitHub for months and nobody told him. The cost isn't the tool; it's the compounding miss across a year of those.

AI Radar is a small, killable, budget-bounded multi-agent loop that runs overnight and hands Mark a ranked "try this / skip this" verdict before coffee. Not a general news reader — a filter tuned to his stack, his stamina, and his integration goals.

Prior art (what we're borrowing from)

Surveyed the field before writing. The pattern is well-established:

GPT-Researcher (assafelovic/gpt-researcher) — planner → gatherer → summarizer loop, the clearest open-source reference.
Auto-News (finaldie/auto-news) — multi-source personal aggregator (RSS, Tweets, YouTube, Reddit) with LLM filtering. Closest shape to what Mark wants.
ai-news-bot (giftedunicorn) — cron + RSS + Claude/DeepSeek + HTML email. Proves the minimal pipeline works.
OpenClaw Daily Digest (HarrisHan/ai-daily-digest) — 92 Karpathy-curated feeds, concurrent fetch → score → summarize. Zero-dep philosophy worth copying.
LangGraph supervisor pattern — graph of typed-state nodes with conditional routing. Overkill for MVP but the right end-state if Radar ever grows to five+ agents.

Takeaway: Scanner → Filter → Digest is the standard shape. What makes Radar worth building is step 3 (Sandbox Test) and stack-awareness (cross-referencing candidates against TP3/Bidet/Legacy Soil state). Everyone else stops at "here's a list." Radar stops at "I tried it, here's the one-line install."

The four agents

1. Scanner agent — raw candidate gatherer

Job: pull everything, dedupe, hand off. No scoring, no judgment.

Sources (tiered by trust/cost):

Tier	Source	Access	Notes
A	Gmail newsletters (AI Daily Brief, Latent Space, Ben's Bites, TLDR AI, Import AI)	Gmail MCP, label `AI-Newsletters`	Already subscribed. Free. Highest signal-to-noise.
A	Anthropic / OpenAI / Google / DeepMind blogs	RSS	No auth. Canonical.
A	GitHub releases for Mark's starred + followed repos	`gh api user/starred`, `/repos/:owner/:repo/releases/latest`	Filter to repos touched in last 90 days.
B	r/LocalLLaMA, r/ClaudeAI, r/MachineLearning top-of-day	Reddit `.json` endpoints, no auth	Cap at top 15 each.
B	HackerNews top + Show HN	Algolia API, free	Filter by keyword whitelist (LLM, MCP, agent, Claude, Ollama, RAG, Whisper, local).
B	Product Hunt AI launches	RSS	Low signal but catches launches.
C	Mark's podcast show notes (week's episodes)	RSS of subscribed feeds	Transcript grep for tool names.
C	Hugging Face trending models + papers	HF MCP (already wired)	Capped to last 24h.

Dedup: fingerprint sha256(lowercased_title[:80] + canonical_domain). SQLite at ~/.ai_radar/seen.db with 30-day TTL. Cross-source URL collapse: two sources pointing at the same repo merge into one candidate with both citations.

Output: {id, title, url, source, snippet, published_at, raw_text} JSON to ~/.ai_radar/runs/YYYY-MM-DD/candidates.json. No LLM calls here — pure fetch + parse + hash. Token cost: zero.

2. Filter + Analyze agent — the one that thinks

Job: score every candidate, keep top N, explain why.

Scoring rubric (0–5 each, weights in parens):

Stack fit (×3) — does it touch Claude Code, MCP, TP3 (Postgres/pgvector/MinIO/Ollama), Bidet AI (Whisper/Gemini), Legacy Soil (Cloudflare Pages/Workers), or middle-school teaching workflow?
Integration value (×3) — does it help connect more accounts/devices/services? This is tonight's lesson: a new MCP server beats a new model.
Effort-to-adopt (×2, inverted) — 5 = one-line install, 0 = full weekend. Mark is a teacher, not a coder.
Maturity (×1) — released stable = 5, preview = 3, alpha = 1. Not a dealbreaker, just a tag.
Novelty (×2) — penalizes duplicates of things already installed. Checks against a short YAML at ~/.ai_radar/stack_state.yaml listing Mark's current MCP servers, models, and key tools.

Weighted total out of 55. Below 20 drops off the radar. Above 35 goes to Sandbox. 20–35 goes into "mentioned, not tested."

Stack-state YAML is the load-bearing file. Hand-curated first, then updated nightly by the Digest agent as Mark adopts things. Keys: mcp_servers, models_local, models_cloud, active_projects, known_gaps (free-text, e.g. "browser automation"). known_gaps is what lets Radar say "this fills your browser automation gap" instead of generic scoring.

LLM call: one batched Sonnet call per ~20 candidates, structured JSON output. Budget: ~15K input / ~3K output tokens per run. Sonnet not Opus — filtering doesn't need the big model.

3. Sandbox Test agent — actually tries the thing

Job: for the top 3 scored candidates, attempt a real install in isolation, report what happened.

Hard safety rules (non-negotiable):

1. All work happens inside docker run --rm containers OR a dedicated scratch dir at ~/.ai_radar/sandbox/<run_id>/. Never in ~/.claude/, ~/projects/, or anything touching production config.

2. Network access: yes (installs need it). Filesystem access: only the sandbox dir. Credential access: none. env -i strips the environment before tool install.

3. Timeout: 5 minutes hard kill per candidate. 15 minutes total for stage 3.

4. If the thing wants an API key or OAuth, test stops with verdict needs_credential. Radar never writes credentials. That rule is from Mark's working rules and it is absolute.

5. No sudo. No global installers (pip install --user or npm install inside scratch dir only).

Three test archetypes:

MCP server → spin ephemeral config, attempt list_tools handshake, capture stdout/stderr, record install command.
CLI tool / Python package → python -m venv, install, run --help, capture version.
Model / weights → don't download full weights (bandwidth + disk). Read model card via hf hub_repo_details, record size and license, verdict metadata_only.

Output: {verdict: worked|failed|needs_credential|metadata_only, install_cmd, stderr_tail, time_seconds, notes} to candidates_tested.json.

Token cost: near-zero. Mostly subprocess execution. Only calls the LLM to summarize stderr on failure (~500 tokens).

4. Recommendation + Digest agent — writes the morning

Job: produce a file that looks and feels like /AI_Library/Daily_Compilations/2026-04-14_morning_digest.md.

Format (fixed sections):

1. ## Radar — top 3 for <date> — ranked, each with: verdict icon, one-line install, why it matters to your stack, estimated adopt time.

2. ## Try this — the single item most worth Mark's weekend time, with the exact command and the exact file it would touch.

3. ## Skip this — one or two items that scored high but Sandbox broke, or that duplicate something installed. Explains why so Mark learns the filter.

4. ## Already have an equivalent — mercy killings. Stops Radar suggesting the same thing twice.

5. ## Mentioned, not tested — the 20–35 score band. Titles + links only, no prose.

6. ## Half-life watch — items aging out of the radar (see creative section).

Write path:

/AI_Library/Daily_Compilations/<date>_radar.md (parallel to morning digest, not replacing it)
Auto-append entry to /AI_Library/Reports_Site/manifest.json with category: "digest" and a short description. The existing build.py handles the rest.
Optional Slack post to #morning-digest when SLACK_ENABLED=1.

Token cost: one Sonnet call, ~5K in / ~2K out.

Total run budget (all four agents): ~25K input / ~6K output tokens. At current Sonnet pricing that's under $0.20/day. Weekend-safe.

Creative section (outside the obvious)

1. "Tonight's update breaks your workflow" alerts. Before scoring, Scanner cross-references every GitHub release against stack_state.yaml. If a repo Mark depends on (chrome-devtools-mcp, claude-code, omi-mcp, etc.) ships a breaking change, it jumps to the top of the digest as a red banner, not a normal candidate. This is the inverse of the original problem — not "what did I miss" but "what just broke under me."

2. Self-hosted RSS bridge for API-less sites. Small Cloudflare Worker at radar-rss.thebarnetts.info using RSSHub or FreshRSS for sites without feeds (vendor blogs, small tool authors who publish only on Twitter/X or personal sites). Lets Radar watch Simon Willison's blog, Karpathy's tweets, HF org announcements — all through a single RSS contract. Decouples Scanner from the long tail of weird source formats.

3. Half-life scoring. Every candidate starts with a freshness score of 1.0. Each day it isn't acted on, multiply by 0.7. When it drops below 0.2 it's retired to a "stale" archive and stops appearing in the digest. Prevents the radar from nagging Mark about the same tool for two weeks. The archive is searchable so "whatever happened to that thing you mentioned Tuesday?" still works.

4. Gap-driven scanning. known_gaps in stack_state.yaml isn't just a scoring input — Scanner also runs targeted searches for those gap terms on HN and GitHub. If Mark's gap list says "browser automation," Radar actively hunts for new browser-automation MCPs instead of waiting for one to show up passively. This is the feature that would have caught chrome-devtools-mcp the day it was posted.

5. Adoption feedback loop. When Mark replies to a digest with "installed it" (or runs a radar adopted <id> command), Radar moves that item into stack_state.yaml under the right category automatically. The filter gets smarter over time without Mark hand-editing YAML. Bonus: it can now warn him when that tool ships a breaking release (loop back to idea #1).

Deployment plan

Host: Apex, not G16. Apex is always on (G16 sleeps with the lid), already runs the scheduled-task pattern (TP3 Launch At Logon, Apex-TP3-Mission-Control), has Postgres if we ever move seen-items out of SQLite, and keeps G16 clean as mission-control. Radar becomes the third scheduled task alongside the existing two.

Schedule: AI-Radar-Nightly at 03:00 America/New_York — after the TP3 ingestion loop, before Mark wakes, after the US AI-blog firehose has settled.

Manual trigger: radar now shell alias on both G16 and Apex, resolves to Start-ScheduledTask -TaskName "AI-Radar-Nightly" on Apex (per the never-SSH-launch rule). On G16 it's ssh apex 'powershell Start-ScheduledTask ...'. Results appear in Drive within ~10 minutes either way.

Output destinations (in priority order):

1. /AI_Library/Daily_Compilations/<date>_radar.md — always written, always synced via Drive Desktop.

2. manifest.json append → Cloudflare Pages rebuild → public URL per Mark's clickable-links rule.

3. Shared memory MCP: one summary memory per run, type radar_digest, so Apex Claude and G16 Claude see it on next wake.

4. Slack #morning-digest — opt-in via env flag, off by default until tokens are stable.

5. Email — not in MVP. Gmail drafts work but add failure modes.

Failure modes and graceful degradation:

Failure	Behavior
Source timeout (any single source)	Log it, continue. Digest footer lists which sources were skipped.
Token budget exceeded mid-run	Skip Sandbox stage, still produce digest from Filter output. Digest header flags "no sandbox this run."
Sandbox container fails to start	Top 3 get `verdict: untested`, digest runs anyway.
LLM API 5xx	Retry once with jitter, then fall back to Haiku for Filter stage, then fall back to "no-score dump" (Scanner output + dedup, no ranking).
Drive sync offline	Write to `~/.ai_radar/runs/` locally. Next run detects backlog and pushes.
`stack_state.yaml` missing or malformed	Use last-good snapshot from `stack_state.yaml.bak`. Never hard-crash on Mark's hand-edits.

Kill switch: ~/.ai_radar/STOP file. If it exists, the scheduled task exits immediately with a Slack notice. Creating that file is the one-line "stop everything" Mark has if Radar goes sideways.

Budget guardrail: Radar checks ~/.ai_radar/token_budget.json at startup. Weekend runs (Fri/Sat/Sun) have a hard cap of 15K input tokens — half the weekday budget — enforced by truncating Scanner output before Filter sees it. Respects the "tokens are scarce on weekends" rule.

Credentials: Radar runs under a read-only scoped Anthropic API key separate from Mark's main key, stored in Apex's .env. GitHub access is an unscoped public-read token. Gmail is the existing OMI-authorized read scope. No write scopes anywhere. If a source needs auth Radar doesn't have, it skips the source and logs it — never prompts, never stores a new credential.

What this does NOT do (by design)

Does not auto-install anything into Mark's real environment. Sandbox is throwaway; adoption is always Mark's explicit call.
Does not replace the morning digest Claude writes. Radar is a sidecar section/file, not a replacement.
Does not send outbound messages to humans other than Mark (no tweets, no GitHub issues, no emails).
Does not run during the day. One shot, overnight. If Mark wants a midday check, that's radar now.
Does not scan paid sources or anything behind a login wall it doesn't already have read access to.

Minimum viable version we could build this weekend

Scope for Saturday afternoon — roughly 4 hours of delegated Cursor work, Mark reviews, then schedule it.

What's in the MVP:

1. Scanner v0 — three sources only: RSS for Anthropic + Simon Willison + HN Algolia /search?tags=story&query=MCP. Python script, feedparser + requests. Writes candidates.json. No dedup database yet — in-memory per run.

2. Filter v0 — single Sonnet call, rubric from section 2, but only three scoring axes (stack fit, integration value, effort). Hand-written stack_state.yaml seeded with Mark's current MCP list plus known_gaps: [browser automation, audio diarization, daily radar itself].

3. Sandbox v0 — SKIPPED. Too much infra for weekend one. Filter output goes straight to Digest with verdict: untested. We add Sandbox in week 2 once the pipeline is proven.

4. Digest v0 — one Sonnet call producing <date>_radar.md in the existing digest folder, one manifest.json append, done. No Slack, no email.

5. Scheduled task v0 — Apex Task Scheduler entry, 03:00 nightly, wrapped in a .ps1 that writes a heartbeat file so we can tell if it ran.

6. radar now alias — one-liner on G16 that SSHes to Apex and fires the scheduled task.

7. Kill switch — the STOP file check. Non-negotiable even in MVP.

What's explicitly deferred to v1/v2:

Sandbox agent (v1, next weekend)
Gmail newsletter ingestion (v1 — needs Gmail MCP auth verification first)
Half-life scoring (v1)
GitHub releases diff + breaking-change banner (v1)
RSS bridge Worker (v2)
Adoption feedback loop (v2)
LangGraph migration (v3 if ever — only if we hit the three-agent-complexity wall)

Success criterion for MVP: one morning next week Mark wakes up, opens /AI_Library/Daily_Compilations/<date>_radar.md, and finds at least one item he hadn't seen yet that's worth 30 minutes of his time. That's the whole bar. If it hits it once in the first week, the architecture is validated and we build out v1. If it doesn't, we kill it — no sunk-cost expansions.

First task to hand Cursor: write scanner_v0.py (three sources, writes candidates.json, 80 lines max) and filter_v0.py (one Sonnet call, rubric hardcoded, writes scored.json, 100 lines max). Both to ~/projects/ai-radar/ on G16 first, tested locally, then deployed to Apex via the same Drive-sync pattern the rest of the stack uses.

Belongs in the reports site under category plan. Add to manifest.json when ready.