Spotify podcast history into TP3 — spec + three-path eval (May 20, 2026)

Path	Transcript coverage	Your one-time effort	Ongoing cost	Verdict
1. Spotify Web API direct /me/player/recently-played + /shows + Spotify-native transcripts	~10–15% of episodes. Spotify launched transcripts in 2024 but coverage is patchy and there is no API to fetch them — they're rendered in-app only. The community has begged for an endpoint for years; Spotify confirmed (April 2026) they don't share audio/podcast content with Anthropic for training, which is the same wall here.	One OAuth login (~3 min). Authorization Code w/ PKCE, durable refresh token, runs for years without re-auth.	$0. Read-only scopes are free. Feb-2026 catch: Dev-Mode apps now require the owner to hold active Premium or the app dies. Verify before relying.	Listening signal yes, transcripts no. Useful as half of the hybrid.
2. YouTube cross-reference your insight For each Spotify episode, find the same episode on YouTube, pull auto-captions via youtube-transcript-api or yt-dlp	~75–90% for the podcasts you actually listen to (NLW's AI Daily Brief, Matt Wolfe, Lex, Hard Fork, Acquired, etc. — all cross-post to YouTube with auto-captions). The miss cases are interview shows that go Spotify-exclusive (rare for your taste) and the occasional episode where the YouTube upload date drifts from the Spotify release date by a few days.	None beyond Path 1's OAuth. YouTube transcript fetch needs no auth at low volume from a residential IP — this is exactly Apex's situation.	$0 if we use auto-captions. YouTube Data API v3 quota for episode-matching is ~3–5 units per lookup; 10k/day free quota covers thousands of matches.	Highest-coverage transcript path. But it needs Path 1's listening signal to know what to look up.
3. Hybrid: Spotify signal + YouTube transcripts + Whisper fallback Spotify history → YouTube transcript → Whisper on Apex GPU for the ~10% YouTube-misses	~95%+. Path 2 catches the bulk; Apex's faster-whisper handles the long-tail episodes that aren't on YouTube but have a downloadable preview/RSS audio (a few major shows still publish open RSS).	Same as Path 2.	Whisper compute is local on Apex GPU — the only cost is electricity. Post-Saturday's 64 GB RAM upgrade this gets noticeably more comfortable; faster-whisper large-v3 fits with room for the rest of TP3 running alongside.	Recommended for Phase 2. The Whisper leg only fires for misses, so most days it does nothing.
(Not on the list) Audio download/re-encode of Spotify streams	—	—	—	Off-table. ToS-breaking, DRM-cracking, and you said you don't need it. spotDL is also broken under Feb-2026 API changes — not even an option to debate.

Recommended path — the hybrid, in two phases

Path 3 (hybrid) is the answer, but built in two clearly separated phases so we don't waste cycles on transcripts before you've eyeballed the listening signal. The reasoning:

Spotify alone is a dead end for the goal you actually have (transcripts in TP3). Their transcripts are in-app rendering only and the company explicitly walled it off this year. We already proved that on the May 18 Spotify integration research — the wall is structural, not configuration.
YouTube alone misses the listening signal. Without Spotify telling us what you listened to and when, we'd be guessing from your YouTube subscriptions — and you listen to plenty of podcasts via Spotify that you'd never click play on inside the YouTube app. The Spotify history is the ground truth of what actually landed in your ears.
Whisper as fallback only. Spinning up Whisper for every episode is wasteful when YouTube already auto-captions 75–90%. Whisper earns its keep only on the misses, and the post-upgrade Apex is the right home for it.
The hardware upgrade Friday changes the math. Pre-upgrade, faster-whisper large-v3 is uncomfortable on Apex while TP3's ollama + the rest of the stack is up. Post-Saturday with 64 GB, the Whisper leg becomes routine. This is why Phase 2 lands on the other side of the RAM install.

Phase 1 — the ~2-hour MVP build first

Goal: a flat JSON file of your Spotify listening history, no transcripts yet, just so you can see what you've been consuming. This alone is genuinely valuable — you'll likely look at it and notice patterns you didn't realize.

What it does

One OAuth login (interactive, ~3 min — only you can click "approve" in your real signed-in Chrome, but I drive the rest).
Pulls /me/player/recently-played in a loop with before= cursor pagination, walking backward as far as the endpoint allows.
For each item, joins to /shows/{id} + /episodes/{id} to capture: show name, episode title, episode description, release date, duration, Spotify URL, listened-at timestamp.
Writes the result to ~/spotify_history/listening_history.json on Apex — persistent path, NOT /tmp (which gets wiped on the WSL2 bounce).
Adds a tiny ntfy summary so we both know the dump landed and how many episodes it captured.

The honest scope limit on Phase 1

Spotify's recently-played endpoint only returns the last 50 items, and the before= cursor walks backward roughly ~50 plays at a time with no guaranteed deep history. Real-world community testing puts the usable window at the most recent few weeks to a few months, not a clean 6–12 months. The "last year" target requires a different mechanism: your Spotify Privacy → Account Privacy → "Download your data" extended export, which gives a full streaming history back to account creation as a JSON file Spotify emails you within ~5–30 days. Phase 1 covers the recent window via API right now; the long-tail backfill arrives via the privacy export, which I request on your behalf with one click in your signed-in browser. Both sources flow into the same listening_history.json.

Effort breakdown

15 min: Spotify Developer App creation (drives in your signed-in Chrome).
5 min: OAuth login (you click approve, that's it).
45 min: Python script wrapping recently-played + show/episode joins, refresh-token persistence on a non-/tmp path.
20 min: privacy-export request submitted via your Chrome session.
15 min: ntfy summary + first dry-run + verifying the JSON exists and is non-empty.
~~1.5–2 hours total.

Phase 2 — the full pipeline on your green-light, post-Saturday

Goal: every episode in listening_history.json becomes a row in tp3_memories_local with source='spotify_podcasts', full transcript embedded, searchable through the same /ask + brief surfaces the Boys SMS threads now flow through.

The pipeline, step by step

Episode → YouTube match. For each Spotify episode, query the YouTube Data API v3 search.list with q="{show name} {episode title}" + a release-date window of ±5 days. The first result with a duration within ±10% of the Spotify duration is a confident match. (~3–5 quota units/lookup, well within 10k/day.) Cache the YouTube video ID against the Spotify episode ID so we never re-look-up.
Transcript pull. First try youtube-transcript-api on Apex (fast lane, no API key). If it returns RequestBlocked or empty, fall back to yt-dlp --write-auto-subs --skip-download with --cookies-from-browser firefox and the bgutil-ytdlp-pot-provider plugin (robust lane, handles the 2026 PO-token wall). Both routes are battle-tested in the YouTube power research.
Whisper fallback. If neither YouTube route yields a transcript and the show has an open RSS feed, pull the MP3 and run faster-whisper large-v3 on the Apex GPU. Otherwise mark the episode as transcript_status='unavailable' and move on. No retries against ToS-grey audio sources.
TP3 ingest. One row per episode in tp3_memories_local with: source='spotify_podcasts', doc = full transcript text, metadata_json = {show, episode, release_date, listened_at, duration_ms, spotify_url, youtube_url, transcript_status, transcript_source}. ON CONFLICT (source, external_id) DO NOTHING where external_id = Spotify episode ID. Embed via the local nomic-embed-text ollama model (already running, already used by the rest of TP3).
Cron + fail-loud. Nightly run at ~3 AM ET, ntfy summary on tp3_cursor_report in the existing Mark-speak shape ("12 new podcast episodes ingested, 1 transcript missing, took 7 min"). Non-zero exit + FATAL log if zero episodes or the JSON isn't fresh — per the fail-loud hard rule.

Effort breakdown

1.5 hr: YouTube matcher + match-confidence heuristic + cache layer.
2 hr: transcript-fetch wrapper (transcript-api → yt-dlp → Whisper fallback chain) with proper artifact verification at each stage.
1 hr: faster-whisper Apex GPU integration (mostly a config job — the model is already on disk for Bidet).
1.5 hr: TP3 ingest layer + dedup + embedding job.
1 hr: cron + ntfy + fail-loud + first end-to-end dry run on ~20 episodes.
1–2 hr: backfill the privacy-export historical episodes (mostly waiting on the YouTube matcher to chew through).
~~7–9 hours total. Conservatively budget 10 with verification time.

Sample queries Mark could run once it's live

Same surface as the Boys SMS threads now — /ask over TP3, or directly through Claude with TP3-search in context. Realistic examples:

"What did NLW say about Claude Code last month?" → vector search over source='spotify_podcasts' + show LIKE '%AI Daily Brief%' + listened_at > now() - 30 days, returns the 3–5 matching transcript chunks with episode + timestamp.
"All podcast moments mentioning Anthropic in the last 90 days." → same pattern, broader show filter, ranks by mention density.
"Which podcasts have I listened to most this quarter?" → pure metadata aggregation, no transcripts needed — instant. (Phase 1 alone answers this.)
"Find the episode where Matt Wolfe talked about local Llama running on a Mac mini cluster." → semantic search across all transcripts, scoped to his channel, returns episode + the exact 30-second window.
"What does my listening history say I care about that I haven't built yet?" → LLM aggregation pass over the last 90 days of show notes + transcripts, surfaces recurring topics the brief hasn't already actioned. (This is the long-arc payoff — the twin starts to know what's been going in your ears that you haven't done anything with.)

Risks worth flagging

Spotify API rate limits. The Feb-2026 Dev-Mode reset capped a lot of endpoints — recently-played is still 50/call, but the show/episode joins are now per-call (no bulk). At ~100 episodes/month listened that's ~200 calls/month total — nowhere near the rate limit. Not a real risk.
YouTube transcript availability. The biggest live risk. YouTube broke youtube-transcript-api for days during the Dec 2025 / Jan 2026 backend changes. We mitigate with the dual-lane fetch (api → yt-dlp) but a third tighten by YouTube could blow up both lanes simultaneously. If that happens, the Whisper fallback catches the new episodes but the historical backfill stalls until the community ships a fix. Build the failure path before the happy path — per fail-loud.
Spotify Premium dependency. The Feb-2026 rule that Dev-Mode apps die if the owner's Premium lapses applies to this pipe. If you ever drop Premium, Phase 1's ongoing dump breaks. Worth pinning your renewal in the security tracker.
Copyright / personal-use posture. Storing transcripts of podcasts you've listened to, on your own infrastructure, for your own search — widely understood as personal fair use, same legal posture as the Boys SMS thread storage. What we don't do: publish transcripts publicly, share them with anyone outside your household, or train models on them. The TP3 row is private, the /private/?key= gate is in front of /ask, and no transcript ever lands on the public dashboard. If you ever want to share an excerpt publicly, that's a per-quote decision, not a default.
Privacy export turnaround. Spotify's "Download your data" historical export takes 5–30 days to arrive. Phase 1's recent-API window covers the gap; backfill arrives async.
YouTube cross-post drift. Some shows publish the YouTube version 1–3 days after the Spotify version (Acquired does this, IIRC). The ±5-day search window handles it, but the matcher's confidence score should mark drift > 2 days as "review."

Spotify podcast history into TP3 — spec

Three paths — the evaluation

Recommended path — the hybrid, in two phases

Phase 1 — the ~2-hour MVP build first

What it does

The honest scope limit on Phase 1

Effort breakdown

Phase 2 — the full pipeline on your green-light, post-Saturday

The pipeline, step by step

Effort breakdown

Sample queries Mark could run once it's live

Risks worth flagging

Decision Mark owns