← Dashboard

Spotify podcast history into TP3 — spec

Research + spec only. No OAuth triggered, no API keys requested, no builds shipped tonight. Decision doc to green-light Phase 1.
Lede. The thing you actually want — the last 6–12 months of podcast listening as searchable knowledge in TP3 — doesn't come from Spotify. Spotify is the listening signal ("what did Mark hear, when, how long"); the transcripts live on YouTube, where ~90% of major podcasts cross-post and auto-captions are free. The hybrid pipe is the only one that hits real coverage at zero ongoing cost. Phase 1 is a two-hour Spotify listening-history dump into flat JSON so you can see what you've actually been consuming before we spend a single LLM token on transcripts. Phase 2 is the YouTube cross-reference + transcript pull + TP3 ingest, ~6–10 hours, gated on your "go."

Three paths — the evaluation

PathTranscript coverageYour one-time effortOngoing costVerdict
1. Spotify Web API direct
/me/player/recently-played + /shows + Spotify-native transcripts
~10–15% of episodes. Spotify launched transcripts in 2024 but coverage is patchy and there is no API to fetch them — they're rendered in-app only. The community has begged for an endpoint for years; Spotify confirmed (April 2026) they don't share audio/podcast content with Anthropic for training, which is the same wall here. One OAuth login (~3 min). Authorization Code w/ PKCE, durable refresh token, runs for years without re-auth. $0. Read-only scopes are free. Feb-2026 catch: Dev-Mode apps now require the owner to hold active Premium or the app dies. Verify before relying. Listening signal yes, transcripts no. Useful as half of the hybrid.
2. YouTube cross-reference your insight
For each Spotify episode, find the same episode on YouTube, pull auto-captions via youtube-transcript-api or yt-dlp
~75–90% for the podcasts you actually listen to (NLW's AI Daily Brief, Matt Wolfe, Lex, Hard Fork, Acquired, etc. — all cross-post to YouTube with auto-captions). The miss cases are interview shows that go Spotify-exclusive (rare for your taste) and the occasional episode where the YouTube upload date drifts from the Spotify release date by a few days. None beyond Path 1's OAuth. YouTube transcript fetch needs no auth at low volume from a residential IP — this is exactly Apex's situation. $0 if we use auto-captions. YouTube Data API v3 quota for episode-matching is ~3–5 units per lookup; 10k/day free quota covers thousands of matches. Highest-coverage transcript path. But it needs Path 1's listening signal to know what to look up.
3. Hybrid: Spotify signal + YouTube transcripts + Whisper fallback
Spotify history → YouTube transcript → Whisper on Apex GPU for the ~10% YouTube-misses
~95%+. Path 2 catches the bulk; Apex's faster-whisper handles the long-tail episodes that aren't on YouTube but have a downloadable preview/RSS audio (a few major shows still publish open RSS). Same as Path 2. Whisper compute is local on Apex GPU — the only cost is electricity. Post-Saturday's 64 GB RAM upgrade this gets noticeably more comfortable; faster-whisper large-v3 fits with room for the rest of TP3 running alongside. Recommended for Phase 2. The Whisper leg only fires for misses, so most days it does nothing.
(Not on the list) Audio download/re-encode of Spotify streams Off-table. ToS-breaking, DRM-cracking, and you said you don't need it. spotDL is also broken under Feb-2026 API changes — not even an option to debate.

Recommended path — the hybrid, in two phases

Path 3 (hybrid) is the answer, but built in two clearly separated phases so we don't waste cycles on transcripts before you've eyeballed the listening signal. The reasoning:

Phase 1 — the ~2-hour MVP build first

Goal: a flat JSON file of your Spotify listening history, no transcripts yet, just so you can see what you've been consuming. This alone is genuinely valuable — you'll likely look at it and notice patterns you didn't realize.

What it does

The honest scope limit on Phase 1

Spotify's recently-played endpoint only returns the last 50 items, and the before= cursor walks backward roughly ~50 plays at a time with no guaranteed deep history. Real-world community testing puts the usable window at the most recent few weeks to a few months, not a clean 6–12 months. The "last year" target requires a different mechanism: your Spotify Privacy → Account Privacy → "Download your data" extended export, which gives a full streaming history back to account creation as a JSON file Spotify emails you within ~5–30 days. Phase 1 covers the recent window via API right now; the long-tail backfill arrives via the privacy export, which I request on your behalf with one click in your signed-in browser. Both sources flow into the same listening_history.json.

Effort breakdown

Phase 2 — the full pipeline on your green-light, post-Saturday

Goal: every episode in listening_history.json becomes a row in tp3_memories_local with source='spotify_podcasts', full transcript embedded, searchable through the same /ask + brief surfaces the Boys SMS threads now flow through.

The pipeline, step by step

Effort breakdown

Sample queries Mark could run once it's live

Same surface as the Boys SMS threads now — /ask over TP3, or directly through Claude with TP3-search in context. Realistic examples:

Risks worth flagging

Decision Mark owns