AI Content Pipeline Deep Dive (2/5): Research
written by Stefan Christoph
- 16 minutes readTL;DR: AI agents are confidently wrong about 1 in 10 factual claims. The research phase of a content pipeline isn’t “ask the agent what’s true” — it’s a system of constraints that physically prevent the agent from presenting claims without first fetching a real document. This post shows the trust hierarchy, reference chain following, selective verification, and vault connection patterns that turn AI-assisted research from fiction generator into fact-checker. The key insight: this is tool-use enforcement, not prompt engineering. You don’t ask nicely. You architect the system so lying is structurally impossible.
This is Part 2 of a five-part series dissecting the content pipeline I described in [The AI Content Pipeline][1]. Part 1 covered Ingestion — how raw material flows in and patterns emerge. This post covers what happens after you have a thesis and need evidence.
The Moment That Changed Everything
I was reviewing an article my agent had drafted. The sources section looked clean: numbered references, proper formatting, plausible titles. One citation pointed to an AWS blog post about a feature I’d never heard of. The title sounded right. The URL structure looked legitimate.
I clicked it. 404.
The blog post didn’t exist. The agent had fabricated a reference that looked exactly like a real AWS publication — correct URL pattern, plausible title, appropriate date. If I hadn’t clicked, it would have gone into a published article with my name on it. I wrote about this experience in detail in [The Citation Crisis][3], where I traced the problem to its roots: 110,000+ scholarly papers from 2025 alone contain hallucinated citations that passed peer review. If expert reviewers at NeurIPS can’t catch fabricated references, what chance does a solo author have without systematic verification?
That experience rewired how I think about AI-assisted research. The problem isn’t that language models are occasionally wrong. The problem is that they’re wrong in ways that look exactly right. And the fix isn’t better prompting. It’s better architecture.
Why Prompting Fails and Systems Work
Here’s what happens when you ask an AI agent to “research” a topic without constraints: it generates plausible-sounding paragraphs that mix real facts with subtle fabrications. It cites papers that don’t exist. It attributes quotes to the wrong people. It presents outdated information as current. A 2026 Springer review of fact-checking approaches found that even state-of-the-art LLMs struggle with binary hallucination detection, with the best models achieving F1 scores as low as 0.625 on “hard” category hallucinations [4].
This isn’t a bug you can prompt-engineer away.
Language models predict likely text, not true text. The research phase of the pipeline exists to turn that liability into an asset: use the agent’s breadth to find sources quickly, but verify everything against primary documents before it reaches the reader.
The key insight (and this took me months of building to internalize) is that this is tool-use enforcement, not prompt engineering. You don’t ask the agent to “please cite sources” (it’ll hallucinate them with perfect formatting). You architect the system so the agent physically cannot present claims without first fetching a real document. The constraint lives in the system, not in the prompt.
Think of it like input validation in software engineering. You don’t trust user input. You don’t trust model output either. The difference is that model output looks trustworthy, which makes it more dangerous than obviously malformed user input.
The Trust Hierarchy
Not all sources are equal. My pipeline enforces a strict ranking that determines default authority when sources conflict:
The trust hierarchy from my skill configuration, sources ranked by authority:
| Tier | Source | Strengths | Weaknesses |
|---|---|---|---|
| 1 | Official docs | Authoritative, versioned, current | Lags releases, misses edge cases |
| 2 | Primary sources | The actual paper, CVE, commit | May lack context |
| 3 | Expert analysis | Recognized experts, peer-reviewed | Opinions mixed with facts |
| 4 | Community signals | Real-time, tribal knowledge | Stale, unverified |
| 5 | LLM memory | Broad, always available | Knowledge cutoff, confidently wrong |
The hierarchy isn’t about infallibility. Official docs lag releases, community signals go stale, expert analysis mixes opinion with fact. The point is default authority: when sources contradict each other, the agent flags the conflict rather than silently picking a winner. The human resolves it.
Why not trust LLM memory at all? Because it’s right often enough to be dangerous. A 90% accuracy rate means 1 in 10 claims is wrong, and you can’t tell which one without checking. The cost of fetching a source is seconds. The cost of publishing a wrong claim is your credibility. That asymmetry justifies the rule absolutely.
I explored this asymmetry in [Your AI Judge Needs a Judge][5] — the same principle applies whether you’re evaluating model outputs for quality or verifying factual claims. The judge (or the researcher) needs external grounding, not self-assessment.
The Hard Constraint
This single rule prevents the most common failure mode:
From my ground-truth steering file — the constraint that governs all research behavior:
# Hard Constraint
#
# Do NOT present product/feature claims to the user until you have
# fetched at least one official doc source in the same turn.
# This applies regardless of whether the claim originates from
# vault notes, LLM memory, Slack, or email.
# If docs are unavailable or inconclusive, say so explicitly
# rather than presenting unverified content as fact.
Every factual claim must be backed by a fetched source, not memory. The agent can use its training data to find where to look — “I think this is documented in the Bedrock user guide”, but it cannot present the claim as fact until it has actually retrieved and read the document. The distinction is subtle but critical: memory guides the search, verification confirms the result.
In practice, this means the agent makes 5-15 web fetches per article. Each one takes seconds. The total verification overhead for a 2,000-word post is maybe 2-3 minutes of automated fetching. Compare that to the hours a human would spend manually checking the same claims, or worse, the reputational cost of publishing something wrong.
Selective Verification: Not Everything Needs a Source
Not every sentence needs a source fetch. The agent distinguishes between factual claims and analytical framing:
- Factual claims (trigger verification): specific numbers, feature availability, dates, attributions, quotes, version details, API behavior
- Analytical framing (author’s domain): interpretations, opinions, predictions, personal experience, “I believe X will happen”
In practice, a 1,500-word post contains 5-15 verifiable claims. The agent checks those. Your analysis, opinions, and framing are yours to own. That’s where your voice lives. You can write “I think this changes the security landscape” without a citation, but “AWS announced X on date Y” needs one. “In my experience, teams struggle with Z” is analytical framing. “Teams report a 40% reduction in Y” is a factual claim that needs a source.
This distinction matters because over-verification kills the writing. If every sentence requires a footnote, you end up with an academic paper, not a blog post. The skill is knowing which sentences carry factual weight and which carry argumentative weight. The agent learns this distinction from the constraint structure, not from a prompt asking it to “be careful.”
Following the Reference Chain
Good research doesn’t stop at the first source. Articles link to other articles. Papers cite other papers. The reference chain is where depth comes from, and where the difference between a summary and an original synthesis emerges.
Step 2 of the post-creation skill — Follow the Reference Chain:
# Step 2: Follow the Reference Chain
#
# Fetch articles linked from the source to build a richer reference base.
#
# Constraints:
# - Identify links within the source article to related pieces
# (companion articles, referenced blog posts, research papers)
# - Fetch and summarize the most relevant linked articles (typically 2-4)
# - Prioritize links that are:
# - From the same series
# - Directly support the argument
# - Provide counterpoints
# - Do NOT follow more than 5 links (diminishing returns)
# - If any reference is a PDF: offer to extract figures and embed
# them in the article with attribution
Two design decisions deserve explanation.
First, the “do not follow more than 5 links” constraint prevents the agent from spidering through an entire citation graph, burning context window and time on increasingly tangential material. I’ve watched agents follow 15+ links when unconstrained — by link 8, the material is so far from the original thesis that it adds noise, not signal. The sweet spot is 2-4 linked articles that directly strengthen or challenge your thesis.
Second, “provide counterpoints” is listed as a prioritization criterion. This is deliberate and hard-won. Without it, the agent cherry-picks sources that confirm your thesis, a digital confirmation bias that feels productive but produces weaker arguments. The constraint forces exposure to disagreement, which either strengthens your argument (you can address the counterpoint) or reveals a flaw worth acknowledging honestly.
Example: How Reference Chains Build Depth
When I wrote The 732-Byte Wake-Up Call, the source material was a YouTube video about the Copy Fail vulnerability. The reference chain led to:
- The actual CVE (primary source, Tier 2) — gave me exact technical details
- Anthropic’s Mythos cybersecurity assessment (primary source, Tier 2) — gave me the AI angle
- A kernel security mailing list thread (community signal, Tier 4) — gave me the “27 years” context
- Two expert blog posts analyzing implications (Tier 3) — gave me the “what this means” framing
Each source added a layer the previous one lacked. Without the reference chain, the post would have been a summary of a YouTube video, competent but shallow. With it, it became an original synthesis connecting three separate developments into a thesis about the security equilibrium shifting. The reference chain is what transforms “I watched a video about X” into “here’s what X means when you connect it to Y and Z.”
Finding Your Own Prior Work
One of the most underrated research steps: searching your own published content for connections. This is where isolated articles become a knowledge graph.
Step 3 of the post-creation skill — Find Own Prior Work:
# Step 3: Find Own Prior Work
#
# Search the user's published content for articles that connect to the topic.
#
# Constraints:
# - Search user's website (default: schristoph.online/blog/) for related posts
# - Grep vault drafts folder for related content
# - Identify pieces that can be naturally referenced in the new post
# (same topic, supporting argument, prior exploration of the theme)
# - Aim for 2-5 own references (builds body of work without being
# self-promotional)
# - MUST fetch full content of own articles before writing any derivative
# content (teasers, summaries, references). Do NOT draft from titles
# or descriptions alone — this leads to inaccurate representations
# of the user's own work
That last constraint exists because of a failure mode I hit early and hard: the agent would reference my own articles based on their titles alone, misrepresenting what they actually argued. It would write “as I explored in [previous post], the key challenge is X”, but the previous post actually argued Y. The title suggested X; the content said Y. Now the agent reads the full content before citing anything, including my own work. This is the same verification principle applied inward.
The benefit of self-referencing isn’t vanity. It’s coherence. When posts reference each other, readers who find one post discover a network of related thinking. A reader who lands on this post and follows the link to [The Citation Crisis][3] gets the empirical evidence behind the trust hierarchy. Someone who reads that and follows the link to [LLMs Don’t Do Math][6] gets the deeper principle: models predict what answers look like, not what answers are. The network compounds.
For this post alone, the agent found connections to five of my prior articles, each adding a different dimension to the research verification argument. That’s not something I would have tracked manually across 60+ published posts.
Vault Connections: Private Context That Shapes Public Writing
Not everything that informs your writing should appear in the published post. Customer conversations, internal meeting notes, field observations: these shape your perspective without being citable. But they’re what make the difference between writing from theory and writing from experience.
Step 4 of the post-creation skill — Find Vault Connections:
# Step 4: Find Vault Connections
#
# Search the vault for related content that can inform the post.
#
# Constraints:
# - Grep vault for keywords from the topic (service names, concepts, people)
# - Check: customer workstreams, initiatives, podcast episodes,
# meeting notes, SIFT entries
# - Maximum 3 grep calls (diminishing returns)
# - Vault connections inform the writing but are NOT necessarily
# referenced in the published post — they provide context
When I write about verification pipelines, my vault has notes from actual customer conversations about governance gaps — enterprises where AI-generated internal documents cited plausible-sounding whitepapers that didn’t exist, and nobody caught it for months because the titles sounded right. I can’t cite those conversations. But they make my writing specific rather than generic. They’re why I can write “the compound failure pattern is especially dangerous in enterprise” with conviction rather than speculation.
My vault also contains the ground-truth steering file itself — the very rules that govern how my agent researches. That’s a meta-connection: the system I’m writing about is the system I’m writing with. The research skill verifies claims for this article using the same trust hierarchy this article describes. It’s turtles all the way down, but each turtle is grounded in a fetched document.
Research and Enrichment: Going Beyond Verification
Beyond verifying claims, the research phase actively enriches the post with context the author might not have considered:
Step 6 of the post-creation skill — Research and Enrich:
# Step 6: Research and Enrich
#
# Connect the post topic to relevant services, industry trends,
# or external context.
#
# Constraints:
# - If user requests a connection to an AWS service or topic:
# do web research to find the relevant angle
# - Verify ALL factual claims via web search or official docs
# before including them (posts are public and attributed to the user)
# - Search kiro.dev and aws.amazon.com when referencing those products
# - Keep product tie-ins natural and relevant — MUST NOT read like an ad
# - May suggest a tie-in if obvious, but MUST ask before including
The “not promotional” constraint is critical and I enforce it ruthlessly. Technical professionals have finely tuned BS detectors — the moment a post reads like product marketing, you lose the reader’s trust permanently. The enrichment step adds relevant context (a related paper, a competing approach, an industry trend), not sales copy.
For this article, the enrichment step surfaced the Springer systematic review on fact-checking approaches [4], the tianpan.co production pipeline guide [7], and the Braintrust hallucination detection tools comparison [8]. None of these were in my original research plan. The agent found them because the enrichment step searches broadly, then I decide what’s relevant. Three of the ten results were useful. That’s a good hit rate for automated research.
Verification in Practice: A Worked Example
Here’s what verification looks like for a typical claim in one of my posts:
Claim in draft: “Amazon Bedrock now supports cross-region inference for all foundation models.”
Verification steps:
- Agent searches AWS docs for “Bedrock cross-region inference”
- Fetches the actual documentation page
- Checks: is it all models or a subset? Is it GA or preview? Which regions?
- Finds: it’s GA but only for specific models, not all
- Flags the discrepancy: “Your claim says ‘all foundation models’ but docs show a subset. Here’s the current list.”
The human decides: soften the claim, add the caveat, or restructure the argument. The agent caught the error before it went live. Without the verification constraint, that “all foundation models” claim would have published — it sounds right, it’s the kind of thing you’d expect to be true, and no reader would check. But it’s wrong. And wrong claims with your name on them compound into credibility erosion that’s invisible until someone calls you on it publicly.
In practice, roughly 1 in 3 posts has at least one claim that needs correction after verification. The most common catches:
- Features that are preview-not-GA (the agent’s training data often predates the GA announcement)
- Version numbers that changed between training cutoff and today
- Quotes attributed to the wrong person (the agent confuses who said what at which conference)
- “All” claims that should be “most” or “supported models include”
The Feedback Loop: Research Improves the Researcher
There’s a compounding effect that isn’t obvious until you’ve run this system for months. Every verification failure teaches the system something. When the agent cites a non-existent blog post and gets caught, that failure mode gets encoded into the steering rules. When it confuses preview and GA status, the verification step gets more specific about checking availability status.
My ground-truth steering file has grown from 12 lines to over 100 lines over six months. Each addition represents a failure that was caught, analyzed, and prevented from recurring. The system learns, not through model fine-tuning, but through constraint accumulation. Every mistake makes the next article more reliable.
This is the same pattern I described in Building Agents That Read the Web Right — the agent doesn’t get smarter, but the system around it gets more precise about what “smart” means in context.
Making It Replicable
The research pattern works with any AI agent that can browse the web. You don’t need my specific tooling. You need these principles:
Establish your trust hierarchy. Which sources do you trust most for your domain? Rank them explicitly. Write it down. The ranking resolves conflicts before they become arguments.
Never accept LLM memory as fact. Always verify against a fetched source. Memory guides the search. Verification confirms the result. These are different operations.
Follow 2-4 reference links from your primary source, including counterpoints. Depth comes from connections, not from the first Google result.
Search your own prior work. Build coherence across your body of writing. Isolated articles are forgettable. Connected articles are a knowledge graph readers explore.
Use private context to inform, not cite. Your experience makes the writing specific. Your sources make it credible. These serve different functions.
Distinguish facts from framing. Verify claims. Own your analysis. The boundary between them is where your voice lives.
The constraint that matters most: do not present claims until you’ve fetched a source. Everything else is optimization. That one rule is the difference between AI-assisted research and AI-generated fiction. It’s the difference between a system that makes you faster and a system that makes you wrong faster.
What’s Next
I’m working towards Part 3, which covers Collaborative Writing: how the agent assists drafting without replacing your voice — voice matching, iterative feedback, and the discipline of never letting the agent write your first draft. The research phase gives you evidence. The writing phase turns evidence into argument. The boundary between them is where most AI-assisted writing goes wrong. Expect it around Tuesday, June 9.
Sources
[1] Christoph, S. “The AI Content Pipeline: How I Publish 3x a Week Without a Content Team” — /blog/ai-content-pipeline/
[2] Christoph, S. “AI Content Pipeline Deep Dive: Ingestion” — /blog/ai-pipeline-ingestion/
[3] Christoph, S. “The Citation Crisis: What AI Hallucinations Mean for Your Enterprise” — /blog/the-citation-crisis/
[4] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models (Springer, Artificial Intelligence Review, 2025) https://link.springer.com/article/10.1007/s10462-025-11454-w
[5] Christoph, S. “Your AI Judge Needs a Judge” — /blog/your-ai-judge-needs-a-judge/
[6] Christoph, S. “LLMs Don’t Do Math (They Predict What Math Looks Like”) /blog/llms-dont-do-math/
[7] “Building a Hallucination Detection Pipeline for Production LLMs” (tianpan.co, April 2026) https://tianpan.co/blog/2026-04-10-hallucination-detection-pipeline-production
[8] “Best Hallucination Detection Tools for LLM Applications (2026)” (Braintrust, 2026) https://www.braintrust.dev/articles/best-hallucination-detection-tools-2026
[9] Christoph, S. “Building Agents That Read the Web Right” — /blog/building-agents-that-read-the-web-right/
About the Author
Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.
This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.
❤️ Created with the support of AI (Kiro)