AI Content Pipeline Deep Dive (1/5): Ingestion
written by Stefan Christoph
- 15 minutes readTL;DR: The ingestion phase is not about reading more. It is about building a system that reads for you, files what matters, and surfaces connections between ideas you captured weeks apart — at near-zero marginal cost. This post shows the exact configuration, tools, and constraints that power the first stage of the AI content pipeline I described in the parent article. Two layers: continuous feeds that monitor 13 YouTube channels daily, and ad-hoc captures that turn forwarded emails and quick notes into full research artifacts. The key insight: captured items are not bookmarks. They are research queue entries that produce structured notes with vault connections and blog potential assessments.
This is Part 1 of a five-part series diving into the implementation details of the AI content pipeline I described in [The AI Content Pipeline: How I Publish 3x a Week Without a Content Team][1]. That post covered the architecture — the five phases, the division of labor between human judgment and agent execution, and the philosophy of why this works. This series covers the implementation.
Each part maps to one phase of the pipeline. This one covers Ingestion. The subsequent parts cover Research (Thu Jun 4), Collaborative Writing (Tue Jun 9), Editing (Thu Jun 11), and Publishing (Tue Jun 16).
Why Ingestion Is the Foundation
Most people’s content workflow starts with a blank page and the question “what should I write about?”
That is the wrong starting point. By the time you are staring at a blank page, you have already lost, because the ideas that become good posts do not arrive on demand. They accumulate over days and weeks from the work you are already doing, from conversations that spark a thought, from articles that challenge an assumption, from customer questions that reveal a pattern you had not articulated yet.
The ingestion layer solves this by running continuously in the background, monitoring sources, capturing signals, and building a searchable knowledge base that makes “what should I write about?” a question you answer by browsing rather than brainstorming. I wrote about this shift in [The Bottleneck Moved][2]: when AI accelerates one part of a workflow, the constraint moves elsewhere. For content creation, the constraint was never the writing itself. It was having something worth writing about, with the evidence already gathered.
The Two Layers
The ingestion system has two complementary layers that serve different purposes but feed into the same knowledge base.
Layer 1: Continuous Feeds (Automated)
I monitor 13 YouTube channels, several newsletters, and a podcast feed. The agent checks for new content on a daily cadence during the morning brief and processes everything it finds without me lifting a finger.
Here is the channel monitor configuration — the master list of sources with last intake dates:
Channel monitor — the master list of sources with relevance focus and last intake dates:
| Channel | Focus | Last Intake |
|---------|-------|-------------|
| AI Engineer | Agentic coding, AI infrastructure, MCP, evals | 2026-05-26 |
| Dwarkesh Patel | Deep interviews with tech/AI leaders | 2026-05-26 |
| Machine Learning Street Talk | ML research deep dives | 2026-05-26 |
| No Priors | AI/enterprise CEO interviews | 2026-05-26 |
| Matt Pocock | TypeScript, AI coding workflows | 2026-05-26 |
| Latent Space | AI engineering, infrastructure | 2026-05-26 |
| Pragmatic Engineer | Software engineering leadership | 2026-05-26 |
This list is not static. I started with 20+ channels and pruned to 13 over three months based on signal-to-noise ratio, applying a simple heuristic: at least 2 episodes per month that rate ⭐⭐⭐, content that generates actual posts with traceable lineage, and perspectives I cannot get from internal channels. Channels that consistently produce only ⭐ content get dropped. If a channel has not produced a ⭐⭐⭐ episode in 6 weeks, it is a candidate for removal.
When new content arrives, the agent does not just bookmark it. The workflow is a four-step process that transforms a raw video URL into a structured research artifact:
- Fetch metadata via
yt-dlp --flat-playlist— title, duration, upload date, URL - Rate relevance ⭐/⭐⭐/⭐⭐⭐ based on my encoded work themes
- For ⭐⭐⭐ episodes: Download transcript, analyze content, create a research note with summary, key quotes, vault connections, and blog potential assessment
- Update the inventory with links to research notes
The rating is not arbitrary. It is deterministic based on theme matching. The agent knows my work themes because they are encoded in the skill configuration: agentic AI architecture gets ⭐⭐⭐, a generic product announcement gets ⭐, and the threshold for “download transcript and create a full research note” is ⭐⭐⭐ only. This keeps the system from drowning in low-signal content while ensuring that nothing genuinely relevant slips through.
Override rules matter. For MLST (Machine Learning Street Talk), any episode over 45 minutes automatically rates ⭐⭐⭐ regardless of topic match, because that channel’s depth consistently warrants deep analysis — their recent episode with Michael I. Jordan on collective intelligence [3] is a perfect example of content that would not match narrow keyword filters but produced one of my most-read posts. You will have your own channels where the signal-to-noise ratio is high enough to justify always going deep.
Layer 2: Ad-Hoc Captures (Semi-Automated)
On top of the continuous feeds, I capture individual items as I encounter them in daily work:
- Forward an email to myself (the agent scans self-sent emails daily)
- Drop a note in the inbox folder
- Tell the agent “add this to the reading list”
The capture mechanism is deliberately lightweight because the friction of capturing must be near zero, or you will not do it consistently, and consistency is what separates a system from a habit you tried once. The processing happens later, automatically, during the next morning brief.
The reading list intake constraint from the skill:
Sources to scan:
- Self-sent emails (personal + work address)
- Items tagged #research or #follow-up in meeting notes
- Unchecked TODOs in recent meeting notes
Filter out:
- Event registrations
- Personal forwards (receipts, etc.)
- Action-item emails (subject contains "Answer", "Respond")
Rule: Every item that passes filtering MUST produce a research note.
Listing items without research notes is a pipeline failure.
The key insight, and the one that separates this from a bookmarking tool: captured items are not bookmarks. They are research queue entries. Every item that passes the filter gets a full research note — summary, key quotes, vault connections, blog potential assessment. If I forward myself an article about context engineering, the next morning brief will fetch that article, analyze it against my existing vault of 2,000+ notes, find connections to prior posts and customer conversations, and produce a structured research note I can later turn into a post.
The Skill Configuration: Steps 1–4
Here is the actual constraint block that governs the ingestion-relevant steps in the post-creation pipeline. These are the instructions the agent follows when processing source material for a new post:
Step 1 — Read Source Material:
Constraints:
- If source_note provided: read the full vault note
- If source_url provided: fetch via web_fetch, extract key arguments
and quotes
- If source_url is YouTube: use yt-dlp for transcript
(never ask user to paste title + description)
- If fetch fails (SPA, paywall, auth-required): suggest
"File > Print > Save as PDF" as extraction method
- Identify key arguments, quotes, and insights matching user's focus
- Extract 2-3 strong direct quotes for potential use in the post
Step 2 — Follow the Reference Chain:
Constraints:
- Identify links within the source to related pieces
(companion articles, referenced blog posts, research papers)
- Fetch and summarize the 2-4 most relevant linked articles
- Prioritize: same series, supports argument, provides counterpoints
- Do NOT follow more than 5 links (diminishing returns)
- If reference is a PDF: offer to extract figures for embedding
with attribution
Step 3 — Find Own Prior Work:
Constraints:
- Search user's website (schristoph.online/blog/) for related posts
- Grep vault's 100-myLinkedIn/ folder for related drafts and posts
- Identify pieces that can be naturally referenced
(same topic, supporting argument, prior exploration of theme)
- Aim for 2-5 own references (builds body of work without being
self-promotional)
- Fetch full content of own articles before writing any derivative
content — do NOT draft from titles or descriptions alone
Step 4 — Find Vault Connections:
Constraints:
- Grep vault for keywords from the topic
(service names, concepts, people)
- Check: customer workstreams, initiatives, podcast episodes,
meeting notes, SIFT entries
- Maximum 3 grep calls (diminishing returns)
- Vault connections inform the writing but are NOT necessarily
referenced in the published post — they provide context
These four steps run before a single word of the post is written. The result is a rich context package: source material analyzed, reference chain followed, own prior work identified, and vault connections surfaced. The writing step (Step 7 in the full pipeline) then has everything it needs to produce a draft that is grounded in evidence rather than generated from thin air.
How This Differs from RSS + AI Summaries
If you are thinking “I could do this with Feedly and an AI summarizer,” you are half right. Three differences matter:
Vault integration. An RSS summary tells you what an article said. A research note tells you how it connects to your customer’s container migration project, the podcast episode you analyzed last week, and the question someone asked in a meeting three days ago. The connections are the value, not the summary.
Selective depth. Most RSS+AI tools summarize everything equally, which means you get 200 summaries of equal weight and no signal about which ones deserve your attention. This system rates and filters before processing; only ⭐⭐⭐ content gets the full treatment, which means the research notes you actually read are pre-filtered for relevance to your work.
Pipeline feed-forward. Research notes have a “blog potential” field that makes them discoverable when you are looking for post topics. The ingestion layer does not just inform you — it feeds the next pipeline stage directly. When I sit down to plan next week’s content, I filter research notes by “blog potential: high” and I have a curated list of ideas with source material already analyzed, quotes already extracted, and connections already mapped.
The Tools
yt-dlp for Discovery and Transcripts
[yt-dlp][4] is the backbone of the video intake. We never download actual video; only metadata and subtitles:
# List new episodes since last intake
yt-dlp --flat-playlist \
--print "%(upload_date)s|%(duration_string)s|%(title)s|%(url)s" \
--dateafter 20260520 \
--playlist-end 150 \
"https://www.youtube.com/@AIEngineer"
# Download auto-generated subtitles for a specific episode
yt-dlp --write-auto-sub --sub-lang en \
--skip-download --convert-subs srt \
-o "/tmp/video-%(id)s" \
"https://www.youtube.com/watch?v=VIDEO_ID"
# Clean SRT to readable text
sed '/^[0-9]*$/d; /^[0-9][0-9]:[0-9][0-9]/d; /^$/d; s/<[^>]*>//g' \
/tmp/video-*.srt | awk '!seen[$0]++'
A note on transcript quality: auto-generated YouTube subtitles have roughly 85–90% accuracy for English speakers with clear audio. For heavily accented speakers or dense technical jargon, accuracy drops noticeably. The mitigation is twofold: use professional transcript sources where available (MLST publishes via rescript.info, for example), and remember that research notes are summaries, not verbatim citations. Any direct quote used in a published post gets manually verified against the source video.
The Knowledge Base Structure
Everything lands in a structured vault that mirrors how I think about the content:
06-knowledge/research/
├── youtube-channel-monitor.md # Master channel list
├── ai-engineer-channel-inventory.md # Per-channel episode list
├── mlst-channel-inventory.md
├── answers/2026/05/ # Research notes by month
│ ├── context-engineering-patterns.md
│ ├── agent-memory-goldfish-to-elephant.md
│ └── ...
└── transcripts/ # Raw transcripts (temporary)
Each research note follows a consistent structure: summary, key quotes with timestamps, vault connections (wikilinks to related customer work, initiatives, or prior posts), and a blog potential assessment. That last field is what makes the ingestion layer feed the content pipeline directly. It is the bridge between “I consumed this” and “I should write about this.”
Two mechanisms prevent the vault from becoming a graveyard of unread notes. First, the “blog potential” field creates a filtered view so you only see high-potential items when looking for topics. Second, the morning brief surfaces connections between new content and existing research notes, which keeps older notes alive by linking them to fresh context. Notes that never get referenced in 90 days are not deleted. They are naturally deprioritized. The vault is a searchable archive, not a reading list with guilt attached.
“But Doesn’t This Create a Filter Bubble?”
Yes, and that’s a tension worth sitting with.
The ingestion layer is not for general education. It is for content production. A filter bubble is useful when your goal is depth over breadth. You can’t write with authority about everything. But here’s the thing: the best content comes from connecting your domain expertise to something unexpected. The post that resonates isn’t “here’s another take on the thing everyone is already talking about.” It’s “here’s a pattern from evolutionary biology that explains why your agent architecture keeps failing.”
That novelty requires leaving the bubble deliberately. Not randomly — strategically. My mitigations: override rules for high-trust channels (MLST gets ⭐⭐⭐ regardless of topic match because its depth consistently surprises me), the ad-hoc capture layer for anything that breaks the pattern, and a weekly review that surfaces ⭐ items recurring across multiple sources (which signals a blind spot worth investigating). I also keep two channels in the monitor that are deliberately outside my domain. One on economics, one on design because the cross-pollination produces my most original thinking. The filter bubble is the default. Breaking it is the practice.
From Inputs to Ideas: The Human Part
Raw inputs are not ideas.
An article about a new vulnerability is not a blog post. A customer question about container security is not a narrative. A podcast episode about collective intelligence is not an argument. The creative leap — connecting dots, finding the angle, identifying what is worth saying that has not been said — remains fundamentally human.
But agents help you see connections you would miss. When I review my accumulated inputs, the agent surfaces patterns: “You have saved three articles about AI-driven vulnerability discovery this week. Your last meeting included a question about automated security scanning. There is a thread in a security-focused channel about the same topic.” That pattern recognition does not write the post. It tells me where the energy is, where multiple independent signals converge on the same theme.
The human decision remains: what is my angle? What do I believe about this topic that is worth arguing? What would I want to read if someone else wrote it?
For the [732-Byte Wake-Up Call][5] post, the inputs were a YouTube video about the Copy Fail exploit, Anthropic’s Mythos cybersecurity assessment, and a Kubernetes security analysis. The angle. That these two events arriving simultaneously signal a fundamental shift in the security equilibrium — was mine. The agent did not generate that thesis. But it made sure all three inputs were in front of me at the same time, with their connections already mapped.
When It Breaks
Three failure modes to watch for, because every system has them:
Source goes dark. A channel stops posting or changes format. The inventory shows stale dates, which surfaces in the morning brief as “no new episodes in 3 weeks.” The fix is simple: investigate, and either update the channel URL or drop it from the list.
Rating drift. Your work themes evolve but the rating criteria do not, which means the system gradually becomes less relevant to your actual interests. Fix: review the skill configuration quarterly and update the theme keywords. I do this as part of a broader content plan review.
Processing backlog. Skip morning briefs for a week and 50+ episodes queue up. The system handles this gracefully — it processes in batches, oldest first, but the research notes pile up unreviewed. The fix is discipline, not tooling. The system cannot force you to read what it produces.
The Cost
A question you are probably asking: what does this cost to run?
Near-zero marginal cost. yt-dlp is free and open-source (Unlicense), runs locally, and requires no API keys. Transcript processing uses the same LLM session that is already running for the morning brief. No separate inference calls. Storage is plain markdown files in an Obsidian vault. The only real cost is LLM inference for analyzing transcripts and generating research notes, which is bundled into the existing agent session that I am already paying for.
No separate API keys. No cloud services beyond what you already have. No subscriptions. The infrastructure cost of running this system is effectively the cost of the AI assistant you are already using for other work.
Making It Replicable
You do not need my exact setup. The pattern is transferable:
- Pick 5–10 trusted sources that consistently produce content relevant to your work
- Set up automated monitoring — yt-dlp for YouTube, RSS for blogs, email scanning for ad-hoc captures
- Define a rating system that separates “interesting” from “worth deep analysis”
- Create research notes for high-rated items — summary, quotes, connections to your work, blog potential
- Review weekly and let patterns emerge
The tools are interchangeable. You could use Whisper instead of YouTube’s auto-captions. You could use Notion instead of Obsidian. You could use a different LLM. The discipline is what matters: the ingestion layer works because it runs every day whether you feel creative or not. When inspiration strikes, the material is already there, organized, connected, and waiting.
What’s Next
I’m currently working on Part 2, which covers the Research phase: how the agent verifies claims against a trust hierarchy of sources, follows reference chains to build evidence, and flags contradictions for human resolution rather than silently picking winners. Expect it around Thursday, June 4.
Sources
[1] S. Christoph, The AI Content Pipeline: How I Publish 3x a Week Without a Content Team (May 2026) — the parent article describing the full five-phase pipeline architecture
[2] S. Christoph, The Bottleneck Moved: What 10 Studies Say About AI Developer Productivity (May 2026) — why accelerating one workflow phase shifts the constraint elsewhere
[3] S. Christoph, Intelligence Is Collective, Not Artificial (May 2026) — the Michael I. Jordan episode on collective intelligence that emerged from the MLST override rule
[4] yt-dlp — GitHub (Unlicense) — open-source tool for downloading YouTube metadata and subtitles without video
[5] S. Christoph, The 732-Byte Wake-Up Call (May 2026) — example of how reference chain following transforms a single source into an original synthesis
[6] T. Kimutai, The Ingestion-Aware Content Strategy (2026) — external perspective on designing content workflows around ingestion patterns
About the Author
Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.
This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.
❤️ Created with the support of AI (Kiro)