Fresh Eyes Review — Is TP3 Healthy? Three Agents. One Verdict.
The one-minute version
Your direction is right. The plumbing is wrong. Meta acquired Limitless (the closest competitor) in December 2025. Zuckerberg is running two parallel personal-AI projects reported in April 2026. StoryFile (the closest shipped "posthumous AI" product) filed Chapter 11 trying to do what you want. Nobody has solved this cleanly — you're on the frontier, not behind it.
The daily fires are not an architecture error, they're an operator surface error. ~17 running daemons, ~15 scheduled tasks, 6 places that decide "local or cloud," a health check that returns 200 OK while the queue processor is hung for 40 hours. Every daily fire is the stack demanding attention that a thinner, more honest design would not demand.
You do NOT need to scrap TP3. The data layer (Postgres + pgvector + MinIO, 646,000 embedded rows) is sound and is exactly what the 2026 industry consensus still recommends. What needs replacing is the PowerShell-supervised Python monster on Windows — not the data, not OMI, not the model.
The cost: 5–10 days of AI-agent time (not yours). One focused rebuild sprint of the plumbing. When it's done, target is weeks between fires, not hours.
Where all three agents agreed
Where they disagreed
Three paths forward — and the one I recommend
A. Triage-only (2–3 days)
What: fix the top 3 incidents and stop. Collapse to one embedding model; make /health return numbers everywhere; kill SSE MCP duplicates; delete Redis. Leave everything else.
Tradeoff: Fastest. Keeps daily fires down to weekly fires maybe. Doesn't address the underlying "15 scheduled tasks, 17 daemons" sprawl. You'll be back here in 2 months.
B. Rebuild the operator layer, keep the data RECOMMENDED
What: 5–10 days of AI-agent time (Cursor/Claude, not you). Move all Python services into Docker Compose with proper healthchecks and restart: always. Replace the custom BaseHTTPRequestHandler ingest worker with FastAPI in a container. One embedding service fronting Ollama (and an opt-in Gemini re-embed job that never runs silently). One table, one embedding dimension. Kill PowerShell supervisor layer entirely. Add hybrid search (BM25 + RRF) while we're in there.
Keep: Postgres (with the 646K rows), MinIO, Cloudflare tunnel + Workers + Pages, OMI integration, Bidet, Oracle.
Tradeoff: One focused sprint of my/Cursor's time. You review the plan, we ship. After: target is weeks between fires, not hours. This is what the post-mortem's "top 5 highest-leverage changes" actually costs in calendar time.
C. Pivot away from wearable capture
What: Drop OMI as primary capture. Rely on Bidet (your brain-dump app) + phone mic + periodic interview sessions. Turn TP3 into a curated-interview corpus more than a passive-capture lake.
Tradeoff: Sacrifices always-on passive capture. BUT — this is ethically and technically the path the "Digital Uncle Mark" endgame needs anyway (per Agent 3's research). You may want to do this eventually; the question is whether now is the right time. My read: not yet. Get the plumbing right first (Option B), then decide if wearable-OMI stays in the mix or becomes one of several capture sources.
My read on options: B. The post-mortem said it plainly — "data layer is right, operator layer is wrong." We have working infrastructure that took months to build. Throwing it out to go clean-slate (Agent 1) is emotionally tempting after a frustrating week, but the 646K rows are load-bearing for the whole Uncle Mark vision. Refactor the plumbing, keep the data.
What I'd do first: if you green-light Option B, I write a migration plan that moves services one-at-a-time into Docker Compose with no downtime. You review it. Then I execute. This gives you a checkpoint before anything risky runs.
Clean-Slate Architect
Brief: "If you had nothing running, what would you build from zero today?" — agent given user context, goal, hardware, constraints, but forbidden from reading any existing code.
1. High-level diagram
PHONE (Pixel / Samsung) - OMI app + Claude mobile
- OMI captures voice all day
- POSTs transcript JSON to capture.thebarnetts.info
|
v HTTPS (Cloudflare Tunnel)
APEX (Windows 11, always-on, Docker Desktop)
Capture API --> Durable Queue --> Embed Worker
(FastAPI, (Postgres (Ollama nomic-
one box) outbox) embed-text-v2,
768-d, GPU)
| |
| writes raw JSONL | upserts
v v
Postgres 16 + pgvector (single DB, ~50GB budget)
- memories(id, text, embedding, meta, ingested_at)
- outbox(id, payload, state, tries, next_try_at)
- health(component, last_ok_at, last_err)
^ ^
| |
Query API <-- Local LLM Watchdog
(FastAPI, (Ollama (every 60s;
same proc) llama-3.1-8B) posts to Slack)
|
v HTTPS
CLOUDFLARE (edge - free tier)
- Tunnel: capture / query routes
- Pages: ask.thebarnetts.info static chat UI
- Access: Mark's Google login only
MARK (dinner table, on a phone) -> types a question
-> Pages UI -> Query API -> Postgres search + local LLM
-> sourced answer with citations
Three running things on Apex, one Postgres, one edge. That is the whole system.
2. Why this shape
One process, one database, one box. Not a microservice mesh. Not a "neural stack" of six containers. A FastAPI app with three routes (/capture, /ask, /health) and a background task for embedding. Every moving part you add is a part that can go silent while the dashboard says "green." Mark has been burned three times by exactly that. The cure is fewer parts.
Async ingest via a durable outbox, not a message broker. OMI webhooks land, get written to a Postgres outbox table in the same transaction as the raw transcript, and return 200 OK in under 100 ms. A background task on the same process drains the outbox. No Redis. No RabbitMQ. No Celery. If Apex reboots, the outbox is still there. If embedding is down, the outbox backs up visibly — and that backup is the alarm. Queue depth is the health signal.
Embedding model: Ollama nomic-embed-text-v2 (768-d) on GPU. Nomic v2 is open-weights, runs locally, is Matryoshka-trained (you can truncate to 256-d later if storage matters), and performs within a hair of the big cloud embedders on retrieval benchmarks. Loaded once, kept warm. 768-d means a million memories fits comfortably in Postgres on a consumer SSD. Crucially: no cloud fallback. If Ollama is down, the outbox backs up and the watchdog pings Slack within 5 minutes. We never silently spend money to paper over a local outage.
Vector store: Postgres + pgvector, HNSW index. Not Pinecone, not Qdrant, not Weaviate, not Chroma. Why: you already need Postgres for the outbox, the health table, and auditing. Adding a second storage engine doubles the things that can go out of sync. pgvector with HNSW handles a million rows at <50 ms query latency on a midrange box — well past what Mark will generate in years. One DB. One backup. One thing to restore.
Query-side LLM: Ollama llama-3.1-8B-instruct (Q4_K_M) on Apex GPU. Good enough for RAG synthesis over retrieved chunks. Opt-in escalation to Claude API exists as a button in the UI ("Use Claude for this one"), never as automatic fallback. Budget is protected by there being no automatic path to the credit card.
Frontend: Cloudflare Pages static site + Cloudflare Access. A single HTML page with a text box and a results pane. Hosted on Pages (free, no server). Protected by Cloudflare Access tied to Mark's Google login — no passwords, no tokens to rotate, no "did I commit a secret." The UI talks to ask.thebarnetts.info which tunnels to Apex.
Monolith vs microservices: monolith. Two reasons. (a) The operator is an AI agent that reads one FastAPI repo and reasons about it end-to-end; a microservice fleet forces the AI to reason across a distributed system, which it will do badly. (b) Mark has ~14 GB RAM on Apex. Docker Desktop + six services + Postgres + Ollama is how you discover swap thrashing at 2 AM.
3. What I would NOT build
- No silent cloud fallback. Cloud calls are a button, never a code path.
- No separate "ingest worker" container. It's an
asyncio.create_taskinside the same FastAPI process. One thing to restart. - No Kubernetes, no docker-compose with six services, no service mesh. Three containers total.
- No MinIO, no object store. Raw transcripts are JSONL files in a dated folder, mirrored nightly to Drive.
- No separate "shared memory" table, no second embedding dimension. One table, one dimension, forever, until we intentionally migrate.
- No "health status" string. Health is numbers: queue depth, last-success-age, row count delta per hour.
- No cron jobs Mark has to know about. A single supervisor task keeps the app running.
- No MCP gymnastics for Mark's day-to-day queries. MCP is for agents talking to the system. Mark talks to a web page.
- No reranker, no query rewriter, no hybrid BM25+vector on day one. Ship vector-only. Add rerank when retrieval quality demonstrably fails, not before.
4. Happy-path day
7:42 AM, bike commute. OMI captures Mark muttering "gotta remember Noel wants to try welding this summer."
- OMI phone app POSTs to
/capture. - Handler chunks the transcript (semantic split, ~400 tokens), inserts rows into
memoriesANDoutboxin one Postgres transaction. - Returns 200 OK in 40 ms. OMI is done.
- Background task wakes, pulls the outbox batch, calls Ollama
nomic-embed-text-v2, writes embeddings back tomemories, marks outbox rows done.
6:45 PM, dinner. Kim asks "didn't Noel mention something about welding?"
- Mark opens
ask.thebarnetts.infoon his phone. Cloudflare Access passes him through via Google. - He types: "did Noel say anything about welding this summer?"
- Query embeds with same Nomic model.
- pgvector HNSW returns top-8 nearest memories in ~30 ms.
- Local llama-3.1-8B synthesizes with strict RAG prompt ("cite the timestamp of each snippet; if snippets don't support an answer, say so").
- Answer renders in 2–4 seconds with clickable timestamps: "Yes — on 2026-04-23 at 7:42 AM you said 'gotta remember Noel wants to try welding this summer.'"
5. Failure modes and their guards
| Failure | Guard |
|---|---|
| Ollama embedding down / cold | Outbox depth climbs. Watchdog: depth > 50 OR oldest pending > 15 min → Slack ping. Query keeps working; capture keeps accepting. Nothing silently spends money. |
| Ollama chat model down at query time | /ask returns clean error: "Local LLM unavailable. Retry, or press 'Use Claude' to send this one query to paid API." Button, not automatic. |
| Postgres down | Capture returns 503. OMI retries. Watchdog pings within 60s. Nothing lost. |
| Disk fills | Nightly check: <20 GB → warning; <5 GB → capture refuses new writes with 507. |
| Apex power loss | Supervisor ("At system startup", not at logon) brings Docker + app back. Outbox durable. Nothing lost. |
| Re-embedding needed | Create memories_v2, backfill, atomically swap pointer, drop old after a week. No mid-flight mixing. |
| "Healthy" but stale | /health returns numbers, not words: last_capture_age_seconds, queue_depth, row_count_last_24h. If OMI hasn't fired in 2 hours during waking hours, alarm. |
| Answer quality regression | Nightly, re-run fixed 20-question eval set with known-good answers. Pass rate drop → Slack. |
6. Ops posture
- Backups:
pg_dumpnightly 3 AM → Google Drive via rclone. 30 dailies + 12 monthlies. Quarterly AI-run restore drill. - Observability: /health = JSON of numbers. Watchdog hits it every 60s; Slack on state change (not "green every minute"). External Cloudflare Worker cron pings public URL every 5 min — catches tunnel/DNS breaks internal watchdog can't see.
- Daily 7 AM digest to Slack: "rows added last 24h: 312. Oldest unembedded: 4s. Query p50: 180 ms. Backup: OK."
- Upgrades: Single Docker image, tagged by date. Promote via
docker compose up -d. One-command rollback to previous tag.
7. Bill of materials
| Layer | Pick | Why (one line) |
|---|---|---|
| App framework | FastAPI (Python 3.12) | One process, type hints, async native, AI can reason about it. |
| Database | Postgres 16 + pgvector + HNSW | One engine for data+queue+vectors+audit. |
| Embedding | Ollama nomic-embed-text-v2 (768-d) | Local, open, Matryoshka-friendly, top-tier open retrieval. |
| Generation | Ollama llama-3.1-8B-instruct Q4_K_M | Runs on consumer GPU, good enough for RAG, zero per-token cost. |
| Queue | Postgres outbox table | One storage layer; depth = alarm. |
| Auth | Cloudflare Access + Google SSO | No passwords, no rotation, existing login. |
| Edge / DNS | Cloudflare Tunnel + Pages | Free, durable, no port-forward. |
| Frontend | Single static page (HTML + htmx, ~300 lines) | No build step, no framework churn. |
| Alerts | Slack #tp3-alarms + #tp3-daily | Already locked channel; no Meta/Telegram mixing. |
| Backups | rclone → Google Drive | 5 TB already available. |
| External pinger | Cloudflare Worker cron | Free, off-Apex, catches what internal watchdog can't. |
8. Honest gaps — what this design does NOT do yet
- No long-term memory summarization. Day 90, not day 1.
- No multi-user / family access. "Digital Uncle Mark" for the nephews is a real project, not a flag.
- No voice interface. Text first. Voice is its own feature.
- No non-OMI sources. Gmail, Calendar, Drive — each a separate adapter to the same table.
- No voice-style fine-tune. Retrieval gets accurate sourced answers; it does not get "sounds like Uncle Mark."
- No deep eval beyond 20 questions. Real recall@k on growing gold set is month-three.
- No mobile-native app. A web page works; an app is polish.
- No automatic "forgetting." Consent/retention policy is a product decision, not something to reflexively add.
Closing note
The reason the current system fires daily is not a bug. It is an architectural shape — too many parts, silent fallbacks, status strings instead of numbers, cron jobs nobody watches. The shape above is smaller on purpose. Three processes. One database. One embedding dimension. No automatic path to a bill. Numbers as health. When it breaks, you know it broke before dinner. That is the whole product.
Post-Mortem on What You Built
Brief: "Independent senior engineer, brutally honest. Read the code and failure history. Is this salvageable or rebuild? No softening."
One-paragraph verdict
The vision is right. The implementation has too many hand-built moving parts for a one-person operation, and that is the root cause of the daily fires — not OMI, not Postgres, not any single bug. You built a local-first digital twin on top of Windows + WSL2 + Docker + a SQLite queue + a custom Python webhook server + two pairs of MCP servers + a Cloudflare tunnel + Cloudflare Workers + Ollama + Gemini fallback + half a dozen PowerShell supervisors + a watchdog that watches another watchdog. Each piece individually is reasonable. Added together, on one 13.8 GB box, it is a full-time SRE job. The architecture is salvageable, but only if you ruthlessly cut moving parts and stop trusting "HTTP 200" as a health signal. Keep the data (Postgres + MinIO + the 646K rows). Rebuild the plumbing thinner. Do not scrap TP3 — scrap the current operator surface around it.
1. What's actually wrong (architectural, not tactical)
1a. You have no real health signal. HTTP 200 has been lying.
Until the 2026-04-22 patch, tp3_omi_ingest_worker.py returned status: "running" whenever the HTTP listener thread was up. The queue-processor threads could hang for 40 hours and /health stayed green. That's not a bug; that's a design error. The health check checked whether the web server answered the phone, not whether the business was running.
Evidence beyond /health: tp3_effectiveness_checks.py queried only tp3_memories for weeks while ingest was writing to tp3_memories_local. Reported RED (ingest dead) while ingest was healthy. tp3_heartbeat_check.py queried columns that never existed — every hourly run silently errored with UndefinedColumn. A health layer wrong in the opposite direction is arguably worse — it trains you to ignore alarms.
1b. Fallback complexity exceeds what it's protecting.
TP3_USE_LOCAL_EMBED toggles Ollama vs Gemini. The worker probes Ollama dim at startup. MCP has a separate flag. tool_search_memories_unified queries both tables with two different pipelines and glues results. Bidet and Oracle each have their own "local first, fallback to cloud" logic. Six places make the local-vs-cloud decision, each with its own failure mode.
That is how the $60/month budget burn happened. Ollama died. Bidet fell through to paid Gemini. You didn't know because Ollama's death was invisible to Bidet, and Bidet's "I'm on Gemini now" was invisible to you.
The clean shape: one embedding service fronting both Ollama and Gemini, logging every switch, per-month spend counter baked in. Nothing else makes that decision.
1c. The 768-d vs 3072-d split is doing real architectural damage.
Two embedding tables, two dimensions — sounded smart. In practice: every search fans out to both. MCP twin.memory_write failed silently for days from dim mismatch. tp3_shared_memory was rebuilt as 768-d after confusion. Both health checks shipped querying the wrong table. Classic premature optimization. At 635K rows, Ollama 768-d is fine. Pick one.
1d. Supervisors and watchdogs have become a coping mechanism.
tp3_service_control.ps1 + AGSupervisor (broken, bypassed) + TP3 Launch At Logon + Ollama Watchdog v3 + Freshness Alarm v2 + Stack Ping + Docker Daily Refresh + DriveHealthWatcher + Weekly Backup + GDrive Backfill Resume + Oracle Direct Launch + Shared-Memory Sync + Mark-Facing Auto-Publish + heartbeat check + effectiveness check + daily upcoming + morning digest.
15+ scheduled tasks on one Windows box, several existing because earlier layers are unreliable. The Ollama watchdog exists because Ollama dies. The freshness alarm exists because the worker lies about health. Each new watchdog papers over a crash rather than fixing the crash. Every watchdog is also a thing that can fail.
1e. Windows is not the right host for this shape.
ProactorEventLoop [Errno 22] is a known Windows asyncio bug class. You've had two confirmed multi-day outages from it. PowerShell 5 mojibakes em-dashes. project_tp3_ingest_fixes.md point #3 is literally "stale Windows ProactorEventLoop — restart the worker." That's not a bug you fix; it's a platform choice you made. WSL2 is on the same box. Running services as Docker containers alongside Postgres eliminates this class of failures entirely.
2. Count the moving parts
Daemons (~17): Docker Engine, Postgres, MinIO, Redis (reserved, unused, eating 1 GB RAM), ingest worker :8944, MCP omi stdio, MCP omi SSE :8933, MCP biometric stdio, MCP biometric SSE :8934, Ollama :11434, cloudflared, 3 Cloudflare Workers, status-dashboard, Oracle :8000, Bidet :8955, plus Antigravity/Cursor/Claude MCP clients.
Scheduled tasks (~15+).
External services (~10): Cloudflare, Tailscale (with a split-DNS rule that already broke things), OMI, Gemini, OpenRouter, Google Drive, Samsung Health, ntfy, Slack, GitHub.
Justified? About 7 daemons are legitimate. The other 10+ are duplicates (stdio + SSE of same class), defensive watchdogs, or leftover scaffolding. Roughly half of the moving parts are cargo-cult or Band-Aid.
3. The "restart fixes it" smell
Three major incidents trace to "kill the worker, it comes back clean":
- 2026-04-02 to 04-04: items stuck in
processingwith [Errno 22] - 2026-04-20 to 04-22: 40-hour silent outage (Ollama cold + retry-forever loop)
- 2026-04-23: 18,500 failures with [Errno 22], 2,871 stuck items
Operationally useful, architecturally damning. What it means:
- Resource leak. Long-running Python on Windows accumulates asyncio-event-loop state. Right fix: replace the long-running worker with a short-lived one, or run on Linux-under-Docker.
- No circuit breaker. Ollama cold → embed timeouts with WinError 10053. Worker caught and retried. Forever. Healthy systems fail loudly after N timeouts in M minutes and flip a /health flag.
- No restart-on-stuck. Linux systemd has
Restart=on-failureandWatchdogSec. Windows Scheduled Tasks can't. You bolted on the freshness alarm — a 30-minute human-polling layer. Right answer is process-level liveness. Do not write more PowerShell.
4. Salvageable or rebuild?
Salvageable. Do not scrap. 646,000 embedded rows took months. Postgres + pgvector + MinIO is the right data layer and does not move.
Cut list (delete, don't replace):
- Redis container (reserved, unused, 1 GB RAM)
- SSE MCP variants (stdio is canonical; SSE is duplicate surface)
- Biometric MCP as separate daemon (fold into one unified MCP)
- The
tp3_memories/tp3_memories_localsplit (collapse to one) - AGSupervisor (formally remove; already bypassed)
tp3_service_control.ps1PowerShell supervision (replace with Docker Composerestart: always)- Half the scheduled tasks
Top 5 highest-leverage changes, in order:
- Move Python services into Docker Compose alongside Postgres and MinIO.
restart: always, proper healthchecks (test: curl -f http://localhost:8944/health), Compose auto-restarts on 503. Kills the "Windows asyncio + PowerShell supervisor + scheduled-task watchdog" stack in one move. - Collapse to one embedding model. Ollama 768-d everywhere, with Gemini as explicit operator-toggled re-embedding job for high-signal rows, not per-request fallback.
- One embedding service. Tiny internal HTTP service (
tp3-embed:8080) fronting Ollama. Every other service calls it. Local-vs-cloud lives in one file. Per-month spend counter baked in. - Real health checks. Three signals: queue not stuck, last-ingest within threshold, embed service responding. Worker exits non-zero when unhealthy. Compose restarts.
- Kill SSE MCP servers; keep stdio only. OAuth gateway handles remote access. Dual stdio+SSE doubles edit surface without benefit.
Optional #6: Replace the custom BaseHTTPRequestHandler (1,045-line worker) with FastAPI + uvicorn in a container — real middleware, real logging, 200-line worker.
5. The observability problem
Why "the numbers don't change" became the trust signal before any dashboard: it's the only signal that cannot lie. /health lied. Slack notifications lied. Effectiveness checks lied. Heartbeat check lied. Dashboards lied. Row count delta is a SQL count(*) against the actual table, so it's the only one you trust. Correct, self-taught observability instinct: tell me the thing the system cannot fake. Honor it.
One page (or ntfy message) every morning with four numbers:
- Row count delta in last 24h.
- Oldest pending queue item age.
- Embed service RTT (p95 ms).
- $ spent on cloud APIs in last 24h.
If any crosses a threshold, page. Everything else is noise.
6. OMI-specific risk
You have bet Layer 1 of your digital twin on OMI hardware and their webhook API. The bet on OMI as primary capture is shaky. Small company. Wearable-hardware startups have high mortality. If OMI disappears tomorrow: you lose always-on passive capture; keep the 635K rows already ingested; keep Bidet and Samsung Health. Survivable but significant quality drop.
The current architecture is too coupled to OMI's webhook format and MCP server.
- Stay on OMI for now — no better-aligned hardware alternative today.
- Abstract the capture interface. Ingest worker should accept a normalized transcript envelope; a thin
omi_adapter.pytranslates OMI webhooks into it. Write the adapter while OMI works, so you're not porting under pressure. - Treat Bidet as your insurance policy. Your app, your code, your pipeline. Invest proportionally.
- Weekly schema-drift test — post a known-good payload, fail loud if shape changes.
Final word
The data architecture is right. The operator architecture is wrong. You're running an SRE-heavy stack with zero SREs. Every daily fire is the stack demanding attention a thinner, more honest design would not demand. Fix is not more watchdogs — it's fewer moving parts, honest health signals, process-level self-recovery. Call it 5–10 days of Cursor/Claude time (not yours) in one focused rebuild sprint. When it's done, TP3 should go weeks between fires, not hours. Stay the course on TP3. Change direction on how it's operated.
What the Outside World Is Doing
Brief: "Pure external research. What has Zuck said? What's OMI's community doing? What do Limitless, Personal.ai, Rewind, open-source projects look like in April 2026? Cite sources."
Top-line
Your wearable-audio digital-twin idea is directionally correct, not a dead end. Meta's December 2025 acquisition of Limitless and Zuckerberg's two parallel 2026 projects confirm the category is real and being chased by the biggest player in social with unlimited money. Nobody is winning cleanly yet. OMI, Limitless, and Plaud all still fight BLE and reliability bugs in their own GitHubs in 2026. Khoj Cloud shut its hosted service April 15, 2026. StoryFile filed Chapter 11. You're not behind a moat; you're on the frontier.
1. What Zuckerberg is actually doing
Two separate Zuck projects, first reported April 2026, often conflated.
Project A — The "CEO Agent" (Wall Street Journal). A personal AI that pulls information for him faster than his org chart can. It's not a twin of him; it's a twin for him. Reaches across Meta's internal files and chat logs. Described as "on-demand information tool," not autonomous decision-maker. Meta employees already have MyClaw (internal files/chat) and Second Brain (Claude-backed "personal chief of staff").
Project B — The Photorealistic Zuck Clone. An AI character that can interact with ~79,000 Meta employees on his behalf, trained on his public statements, mannerisms, tone, strategic thinking. Meta's recent acquisitions (PlayAI for voice, WaveForms) feed the photorealism. Zuckerberg personally codes 5–10 hours a week on it.
Architecture? Almost nothing concrete is public. No papers, no leaked schema. Anyone claiming to know the vector DB or embedding model is making it up. What's known: training corpus is public statements + internal comms; output is real-time conversational + animated avatar; builds on MyClaw + Second Brain, which are Claude-backed.
Sources: WebProNews, Tom's Hardware, The Next Web, Gizmodo, PYMNTS (April 2026).
Takeaway for Mark: The big guys aren't ahead of you on architecture. They're ahead on budget, staff, and voice models. Your idea is the same shape as what Meta is building. Different form factor (wearable vs. keyboard+avatar); same goal. You're directionally correct.
2. The 2026 landscape
OMI / Based Hardware. Open-source AI wearable. nRF5340 SoC, Zephyr RTOS, BLE to phone. Backend: Python/FastAPI + Firestore + Redis + Pinecone + Deepgram/Speechmatics/Soniox STT + LangChain + Silero VAD. 50+ apps in marketplace. Known pain points: BLE disconnects, battery indicator jumping (GitHub issue #2782), CONFIG_OMI_* firmware flags. Changelog shows ongoing BLE stability fixes — reliability is an open problem for them too.
Limitless (formerly Rewind). Meta acquired December 2025. Pendant sales ended, Rewind desktop app sunset. Existing pendants keep working free for a year. Moved from Rewind's strict local-only to "Confidential Cloud" with encryption and "Consent Mode." Model-agnostic — users pick GPT-5, Claude, or Gemini.
Personal.ai. Three pillars: Connectivity Identifier, Continuous Memory, Optimized Inference (on NVIDIA AI Grid). Distinguishes with domain-tuned Small Language Models ("40× cheaper than hosted LLMs") + telephony-native voice stack, sub-500ms.
Screenpipe. 16k+ GitHub stars, MIT license, $400 lifetime. Records screen + mic 24/7, 100% locally. Whisper local. Cross-platform. Key architecture choice: doesn't record every second — listens for meaningful events (app switches, clicks, typing pauses, scrolling) and snapshots only on change. Each capture pairs screenshot + accessibility tree. Most instructive open-source reference for your use case.
Khoj AI. Open-source self-hosted "second brain." Khoj Cloud shut down April 15, 2026 — hosted service couldn't sustain. Self-hosted code continues. Canonical cautionary tale for hosted personal-AI unit economics.
Claude (Anthropic). Memory shipped free + Pro March 2026. Remembers preferences, projects, working style across every conversation automatically. Projects = persistent workspaces. Cowork = desktop agent. Claude Code = terminal agent. You already have this via Max.
Google Gemini / Project Astra. Up to 10 minutes in-session memory + persistent cross-conversation memory. Gemini Live shipped. Smart-glasses slipped to late 2026. Workspace side-panel integration is where Google actually leads.
Apple Intelligence / iOS 19 (Spring 2026). iPhone 17 Pro shipped with 12GB RAM for a 3B-parameter resident model. Substantially behind Gemini + Claude on memory/agentic behavior.
Open-source memory frameworks — LongMemEval scores: Letta ~83%, Zep/Graphiti ~71%, Mem0 ~49%, OMEGA ~95%. Usage split: Mem0 = drop-in API, Letta = full runtime where agents self-edit memory, Zep = episodic/temporal, Cognee = local-first with graph reasoning.
3. Architecture consensus — what the winners have in common
- Hybrid search (BM25 + vector + RRF) is default. Dense-only = ~78% recall@10, BM25-only = ~65%, Hybrid via RRF = 91% with 6ms overhead. Personal-memory queries (proper nouns, dates, phrases) are exactly where BM25 helps most.
- pgvector still default for personal scale. Under 5M vectors, single-digit-ms queries. Pinecone/Qdrant/Weaviate only win above tens of millions. Your TP3 is in pgvector's sweet spot. Don't migrate.
- 80% of failures are in ingestion/chunking, not the LLM. 40–60% of RAG systems never reach production. Stale context is the #1 silent failure. This is exactly your failure mode. Not a you-problem, an everyone-problem.
- Local-first is where 2026 momentum is going for personal-AI data. Khoj Cloud's shutdown is the warning; Personal.ai's "SLMs 40× cheaper" hype matters.
- Mic quality beats diarization cleverness every time. Plaud's dominance is dual-mic beamforming, not fancier models.
- Evaluation framework (RAGAS) is now expected. Winners measure context precision, recall, faithfulness, answer relevance on every deploy.
4. What OMI-specific builders do
- Pendant = dumb capture device. All intelligence server-side. OMI firmware is deliberately thin. Failure mode avoided: "pendant gets smarter, batteries die, BLE chokes." You've made this choice — stick with it.
- BLE is the enemy. OMI's changelog repeatedly flags BLE stability fixes across 2025–2026. Not your personal failure — it's the hardest part of the whole stack, and the mothership is still working on it.
- OMI uses Pinecone — opposite direction from the pgvector recommendation for personal scale. Your move to pgvector is more efficient than the mothership.
- $10M OMI app-store payout pool is real. Third-party devs building reference integrations worth skimming at omi.devpost.com.
- Pain point: desktop auto-meeting detection — pendant knows about conversations; desktop doesn't know about Zoom/Meet without manual linking. Nobody has shipped a clean solution.
5. Honest take on posthumous-AI ("Digital Uncle Mark")
Category exists. It's struggling. Nobody has pulled this off well yet.
- StoryFile filed Chapter 11 (~$4.5M owed). Reorganizing with a "fail-safe" so families keep access if the company folds. Approach: pre-recorded interview + retrieval — not generative.
- HereAfter AI — same approach, interview corpus, no generative fabrication. $4–$8/month or $99–$199 one-time.
- Eternos, Project December — more aggressive generative simulations. Cambridge, Hastings Center, Schwartz Reisman researchers have published repeatedly in 2024–2026 calling for safeguards against "unwanted digital hauntings."
- Legal: no US federal law prevents building bots from the dead or living. Some state right-of-publicity claims; family consent not legally required most jurisdictions.
- Ethics consensus: retrieval-only, interview-sourced, clearly AI-labeled, time-bounded, designed to help the living grieve, not simulate presence. Generative "talk to Mark after he's dead" is exactly what researchers warn against.
What this means for you: the long-term vision is viable, but the path that ships safely is interview-sourced retrieval, not a generative Uncle-Mark-bot trained on everything. Your TP3 capture work is raw material either way. The category hasn't produced a clear winner — you have room.
6. What Mark should steal from the winners — 10 specific patterns
- Adopt hybrid search (BM25 + vector + RRF). Single highest-leverage change. One line via LangChain EnsembleRetriever or LlamaIndex QueryFusionRetriever. ~15-point recall jump.
- Don't migrate off pgvector. You're in the sweet spot. Migrating to Pinecone would be a downgrade for your scale.
- Steal Screenpipe's event-triggered capture. Trigger heavier pipelines only on meaningful events (speaker change, silence break, keyword, location change). Lighter + more reliable than continuous ingestion.
- Install a RAGAS evaluation loop. Nightly, fixed question set. Formalizes "is TP3 healthy" into objective ground truth.
- Use Letta (MemGPT) patterns for long-lived identity. Letta's 83% vs Mem0's 49% on LongMemEval. For a twin that stays coherent across years, this is the winning direction.
- Pendant stays a dumb device (OMI's lesson). All intelligence server-side. Don't let smart logic creep into firmware.
- Mic quality > diarization model. Plaud's edge is dual-mic beamforming. If reliability is the complaint, the mic is first suspect.
- Structure posthumous path as retrieval-only from interview corpora. Cambridge + Hastings research converge. Also the most reliable — can't hallucinate.
- Treat Claude Memory + Projects as the default memory layer for agentic work you're already doing. You pay for Max. Use it explicitly.
- Design for family access from day one. StoryFile's fail-safe pattern is the template for "Digital Uncle Mark." Structure so Kim, Shannon, nephews can export and query without you in the loop.
7. What Mark should avoid — from failed postmortems
- Don't build knowledge hoarding without forcing function. "Papers" postmortem (1,237 hours, 17 versions): "The more knowledge I saved, the less I actually learned." Force queries to actually be asked.
- Don't trust "AI-organizes-itself" marketing. Reality is "a hyperactive librarian who's smart but sometimes completely misses the point." Expect to hand-tune chunking and prompts.
- Don't build without sales/distribution if you want it to outlive you. Cydoc shut Aug 2025 after 7 years: "grossly underestimating sales and marketing." Khoj Cloud shut April 2026 same reason.
- Don't let stale retrieval hide. Top silent failure. Your freshness alarm is correctly targeted.
- Don't assume wearable hardware is done. OMI, Limitless, Plaud — all still have open BLE/battery/reliability issues in 2026. Budget for offline buffering + phone-mic fallback.
- Don't conflate "CEO Agent" with "Digital Twin." AI-for-you ≠ AI-as-you. Build the first. Second is endgame with guardrails.
- Don't go all-cloud. Khoj Cloud shutdown = warning. Local-first is where survivors live. Apex-hosted is the right shape.
Executive summary
Your wearable-audio digital-twin idea is directionally correct. Meta's Limitless acquisition and Zuckerberg's two parallel 2026 projects confirm the category. Nobody is winning cleanly. OMI, Limitless, Plaud all fight BLE and reliability bugs. Khoj Cloud shut April 15, 2026. StoryFile filed Chapter 11. You're on the frontier.
2026 industry consensus on architecture is boring and clear: pgvector under 5M vectors (fine), hybrid BM25+vector via RRF (+15 recall points, one day's work), event-triggered capture (Screenpipe's pattern), RAGAS nightly eval. 80% of RAG failures are ingestion/chunking, not the LLM — your freshness alarm is correctly targeted.
Top 3 takeaways:
- Add hybrid search (BM25 + vector + RRF). Biggest quality jump for one day's work.
- Your freshness-alarm instinct is industry best-practice. Keep doubling down.
- For "Digital Uncle Mark," go retrieval-only from interview corpora, not generative. Cambridge + Hastings researchers converge; StoryFile + HereAfter validate it commercially.
So what next?
You asked three agents for three views. They agree on the shape: keep the data, rebuild the plumbing, kill silent cloud fallbacks, replace status strings with numbers. The only real disagreement is scope — clean-slate (Agent 1) vs refactor-in-place (Agent 2). Since you've got 646K rows that took months to embed, refactor is the right call.
If you greenlight Option B, next deliverable is a migration plan document: which services move into Docker Compose in which order, how to keep ingest live during the move, what we delete and when. You review that before anything risky runs.
If you want to wait, that's fine too — the triage-only patches from 2026-04-22 already cut the bleeding. The daily fires will come back, but slower. The question is whether to do this sprint now, or after DC trip, or after school year ends.