Building Agents That Read the Web Right
The Other Side of the Coin
In a recent article, I made my website AI-agent friendly [1], adding llms.txt, Markdown output, and content negotiation to a Hugo site on AWS. That article was about the producer side: how to serve content that agents can consume efficiently.
But I left a question unanswered: what does it look like from the agent’s perspective?
In this article, I’m building two agents. Same task, same website, same model. The only difference: one reads the web the old way, the other uses the infrastructure I just built. The code is written with the Strands Agents SDK [2], an open-source framework from AWS for building AI agents in Python.
The Experiment
The task is simple: research my blog, discover what articles are available, pick one about AI agents, and summarize it.
Two agents attempt this:
Same task, same number of requests. Different discovery, different format.
- Agent V1 fetches the homepage HTML, parses what it can see, then fetches an article as HTML
- Agent V2 fetches
llms.txtfirst, discovers the full site catalog, then fetches an article as clean Markdown
Both use Claude Sonnet on Amazon Bedrock. Both make the same number of HTTP requests. The difference is what they request and what they receive. The code is copy-paste ready. You can point it at any site with an llms.txt file, not just mine.
Agent V1: The HTML Struggle
The first agent has one tool: fetch_webpage. It fetches URLs and returns raw HTML.
"""
Agent V1: The Struggle — Researching a website using standard web search.
"""
from strands import Agent, tool
from strands.models import BedrockModel
import urllib.request
import urllib.error
@tool
def fetch_webpage(url: str) -> str:
"""Fetch a webpage and return its raw HTML content.
Args:
url: The URL to fetch.
Returns:
The raw HTML content of the page.
"""
req = urllib.request.Request(
url,
headers={"User-Agent": "StrandsResearchAgent/1.0"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
content = resp.read().decode("utf-8", errors="replace")
content_type = resp.headers.get("Content-Type", "unknown")
size = len(content.encode("utf-8"))
return (
f"[Fetched {url}]\n"
f"[Content-Type: {content_type}]\n"
f"[Size: {size:,} bytes]\n\n"
f"{content}"
)
except urllib.error.URLError as e:
return f"Error fetching {url}: {e}"
model = BedrockModel(
model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
region_name="us-west-2",
)
agent = Agent(
model=model,
tools=[fetch_webpage],
system_prompt=(
"You are a research agent. Your task is to research content on "
"websites using the fetch_webpage tool. Analyze what you receive "
"and extract useful information. Be explicit about any difficulties "
"you encounter with the content format."
),
)
result = agent(
"Research the blog at https://schristoph.online. "
"First, fetch the homepage to see what articles are available. "
"Then fetch one article about AI agents and summarize its key points. "
"Report on: (1) how many articles you found, (2) the key points of "
"the article you read, and (3) any difficulties with the content format."
)
# Token usage
usage = result.metrics.accumulated_usage
print(f"Input tokens: {usage['inputTokens']:,}")
print(f"Output tokens: {usage['outputTokens']:,}")
print(f"Total tokens: {usage['totalTokens']:,}")
The agent fetches the homepage, receives 17,886 bytes of HTML: navigation bars, CSS classes, meta tags, structured data, footer links. Somewhere in there are article titles and excerpts. The model has to parse through all of it to find the content.
Result: The agent found 7 articles. Only what’s rendered on the homepage. My blog has over 250 posts. The agent is blind to more than 95% of the content because it’s paginated or linked from archive pages it never visits.
It then fetches one article as HTML (26,156 bytes), extracts the content, and produces a reasonable summary. But it consumed 33,894 input tokens to get there. Nearly half of that is HTML noise the model had to process and discard.
Agent V2: The Smart Way
The second agent has two tools: fetch_llms_txt for discovery and fetch_markdown for content.
"""
Agent V2: The Smart Way — Using llms.txt and content negotiation.
"""
from strands import Agent, tool
from strands.models import BedrockModel
import urllib.request
import urllib.error
@tool
def fetch_llms_txt(base_url: str) -> str:
"""Fetch the llms.txt file from a website to discover its structure.
The llms.txt standard provides a structured index of a site's content
with direct links to Markdown versions of each page.
Args:
base_url: The base URL of the website (e.g. https://schristoph.online).
Returns:
The contents of the llms.txt file, or an error message.
"""
url = f"{base_url.rstrip('/')}/llms.txt"
req = urllib.request.Request(
url,
headers={"User-Agent": "StrandsResearchAgent/2.0"},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
content = resp.read().decode("utf-8", errors="replace")
size = len(content.encode("utf-8"))
return (
f"[Fetched llms.txt from {base_url}]\n"
f"[Size: {size:,} bytes]\n\n"
f"{content}"
)
except urllib.error.URLError as e:
return f"No llms.txt found at {url}: {e}"
@tool
def fetch_markdown(url: str) -> str:
"""Fetch a page as clean Markdown using content negotiation.
Sends Accept: text/markdown to request Markdown format. If the
URL already ends in .md, fetches it directly.
Args:
url: The URL to fetch (can be an .md URL from llms.txt).
Returns:
The Markdown content of the page.
"""
req = urllib.request.Request(
url,
headers={
"User-Agent": "StrandsResearchAgent/2.0",
"Accept": "text/markdown, text/plain;q=0.9, */*;q=0.1",
},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
content = resp.read().decode("utf-8", errors="replace")
content_type = resp.headers.get("Content-Type", "unknown")
size = len(content.encode("utf-8"))
return (
f"[Fetched {url}]\n"
f"[Content-Type: {content_type}]\n"
f"[Size: {size:,} bytes]\n\n"
f"{content}"
)
except urllib.error.URLError as e:
return f"Error fetching {url}: {e}"
model = BedrockModel(
model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
region_name="us-west-2",
)
agent = Agent(
model=model,
tools=[fetch_llms_txt, fetch_markdown],
system_prompt=(
"You are a research agent that uses the llms.txt standard and "
"content negotiation to efficiently consume website content. "
"Always start by fetching llms.txt to discover the site structure. "
"Then use the Markdown URLs from llms.txt to fetch articles. "
"Report on the content format you received and its quality."
),
)
result = agent(
"Research the blog at https://schristoph.online. "
"First, discover the site structure using llms.txt. "
"Then fetch one article about AI agents and summarize its key points. "
"Report on: (1) how many articles you found, (2) the key points of "
"the article you read, and (3) the quality of the content format."
)
# Token usage
usage = result.metrics.accumulated_usage
print(f"Input tokens: {usage['inputTokens']:,}")
print(f"Output tokens: {usage['outputTokens']:,}")
print(f"Total tokens: {usage['totalTokens']:,}")
The agent fetches llms.txt, a structured catalog of every article on the site, with titles, summaries, and direct links to Markdown versions. One request, and it knows the entire site.
Result: The agent discovered all 250+ articles. It picked the most relevant one, fetched it as clean Markdown (15,230 bytes, zero HTML noise), and produced a detailed, accurate summary.
The Numbers
Here’s where it gets interesting. And honest.
Per-Article: Markdown Wins Clearly
For the same article, fetched as HTML vs Markdown:
| Metric | HTML | Markdown | Change |
|---|---|---|---|
| Response size | 26,156 bytes | 15,230 bytes | -42% |
| HTML tags to parse | 439 | 0 | |
| Nav/header/footer elements | 4 | 0 | |
| Noise ratio | 47% | ~0% | |
| Content-Type | text/html | text/markdown |
The Markdown version is 42% smaller and 100% signal. No navigation, no CSS classes, no meta tags. Just the article content with its headings, code blocks, and links intact. Markdown still has structural elements (headings, image references, citations), but those are semantic structure that helps the agent understand the content, unlike HTML’s presentational noise.
End-to-End: Not What I Expected
| Metric | V1 (HTML) | V2 (llms.txt + MD) |
|---|---|---|
| Articles discovered | 7 | 250+ |
| Discovery coverage | <5% of site | 100% of site |
| Article content format | HTML (47% noise) | Markdown (0% noise) |
| Input tokens | 33,894 | 104,979 |
| Output tokens | 1,119 | 1,155 |
| Tool calls | 3 | 3 |
I expected V2 to use fewer tokens. It used three times more.
The reason: my llms.txt file is 121KB. It contains titles and opening paragraphs for all 250+ posts, including Unicode bold characters from LinkedIn-style posts that are multi-byte. The agent ingests the entire catalog in one request, and that catalog is bigger than the homepage HTML.
I went in expecting a story about token efficiency. What I got was a story about discovery quality. V1 found 7 articles, whatever the homepage renders. V2 found all of them. Same number of tool calls, but fundamentally different awareness of what exists.
“But Couldn’t the Agent Just Crawl?”

One has the full picture. The other is guessing.
Fair question. An agent with a basic fetch_webpage tool could discover all articles, the same way a human would. Follow the “Older posts” link, parse the next page, repeat. Or use a search engine: site:schristoph.online AI agents.
The question isn’t whether it’s possible. It’s what it costs.
| Approach | Requests | Tokens (est.) | Coverage | Dependencies |
|---|---|---|---|---|
| Homepage only | 1 | ~18K | 5% | None |
| Crawl all pagination | ~15 | ~270K | 100% | Consistent HTML structure |
| sitemap.xml + fetch | 1 + N | varies | 100% | Fetches HTML, includes non-content pages |
| External search | 1-2 | varies | Partial, relevance-ranked | Search API key |
| llms.txt (verbose) | 1 | ~105K | 100% | Site implements standard |
| llms.txt (optimized) | 1 | ~63K | 100% | Site implements standard |
Crawling pagination would cost ~15 requests and ~270K tokens of mostly HTML noise (navigation bars, footers, and CSS classes repeated on every page). External search gives you 10-20 results ranked by someone else’s algorithm, not the site owner’s structure. And you’re adding a third-party dependency.
llms.txt gives complete coverage in a single request. The site owner curates what the agent sees. No parsing ambiguity, no dependency on HTML structure staying consistent, no search API key.
The advantage isn’t “can vs can’t.” It’s 1 structured request vs 15 noisy ones.
The Verbosity Trade-Off
That 121KB llms.txt bothered me. I looked at what others are doing:
| Site | llms.txt Size | Links | Style |
|---|---|---|---|
| FastHTML (spec author) | 4.8 KB | ~20 | Titles + short descriptions |
| Strands Agents | 42 KB | ~450 | Titles only, nested hierarchy |
| Stripe Docs | 92 KB | ~650 | Titles + one-line descriptions |
| Vercel Docs | 332 KB | ~1,900 | Titles + one-line descriptions |
| My blog | 121 KB | 279 | Titles + ~160 char excerpts |
My file is disproportionately large for its link count. The culprit: Hugo’s .Summary template dumps the article’s opening paragraph as the description. Conversational openers, Unicode bold characters, emoji. Content that helps a human browsing a feed, but doesn’t help an agent decide which article to fetch.
The llms.txt spec [5] is intentionally minimal about this. It says descriptions are optional, each link can have a : description after it, or not. The guidance is simply: “Use concise, clear language.”
So I created a concise variant (titles only, no descriptions) and ran the same experiment. Then I went further: I actually fixed the Hugo template and deployed it. The optimized llms.txt is now live on my site. Zero multi-line spillover, truncated descriptions, long LinkedIn-style titles without redundant descriptions.
Here are the real numbers, all measured against the production site:
| Metric | V1 (HTML) | V2 (old llms.txt) | V2 (optimized llms.txt) |
|---|---|---|---|
| Catalog size | 17,886 B | 121,321 B | 65,003 B |
| Articles discovered | 7 | 250+ | 250+ |
| Discovery coverage | <5% | 100% | 100% |
| Input tokens | 33,894 | 104,979 | 62,740 |
| Total tokens | 35,013 | 106,134 | 63,552 |
| Tool calls | 3 | 3 | 3 |
The optimized llms.txt cuts tokens by 40% compared to the verbose version, while maintaining complete discovery. The file went from 121KB to 65KB, zero spec violations, zero duplicate descriptions.
The honest math: even the optimized version uses ~29K more tokens than V1’s homepage-only approach. But those 29K tokens buy you 100% site coverage instead of 5%. That’s the real trade-off: ~29K extra tokens for complete discovery vs saving tokens and being blind to 95% of the content.
For well-titled blog posts, titles alone are enough for an agent to pick the right article. Descriptions add value when titles are ambiguous, but for most content sites, a concise llms.txt is the sweet spot.
Fixing the Template
So I dug into the Hugo template to fix it. Diagnosing why my llms.txt was bloated revealed three issues:
| Issue | Count | Impact |
|---|---|---|
| Description duplicates title | 217 / 279 (78%) | Tokens wasted repeating information |
| Multi-line description spillover | 187 lines | Breaks the llms.txt spec format |
| Verbose descriptions (~160 chars avg) | 277 / 279 | Opening paragraphs, not summaries |
The root cause: Hugo’s .Summary dumps the article’s opening paragraph, including line breaks that spill into the next line, violating the spec’s one-entry-per-line format. For LinkedIn-style short posts, the title is the first line of content, so the description just repeats it.
The fix is a smarter template that applies three rules:
- Prefer frontmatter
description— if the author wrote a hand-crafted summary, use it - Skip descriptions for long titles — posts with 80+ character titles (typical LinkedIn reposts) are self-descriptive
- Collapse and truncate everything else — strip formatting, collapse whitespace, cap at 100 characters
The Hugo template below is the concrete implementation, but the principles apply to any static site generator or CMS. WordPress, Next.js, Jekyll, Astro all have equivalent template mechanisms.
# {{ .Site.Title }}
> {{ .Site.Params.description }}
## Blog Posts
{{ range (where .Site.RegularPages "Section" "blog").ByDate.Reverse -}}
{{- $title := .Title -}}
{{- $url := printf "%sindex.md" .Permalink -}}
{{- $desc := "" -}}
{{- if .Params.description -}}
{{- $desc = .Params.description -}}
{{- else if le (len $title) 80 -}}
{{- $desc = .Summary | plainify | replaceRE "\\s+" " " | truncate 100 -}}
{{- end -}}
{{- if $desc -}}
- [{{ $title }}]({{ $url }}): {{ $desc }}
{{ else -}}
- [{{ $title }}]({{ $url }})
{{ end -}}
{{- end }}
## Optional
- [About]({{ "about/" | absURL }})
- [Imprint]({{ "imprint/" | absURL }})
The key functions: plainify strips HTML/Markdown formatting. replaceRE "\\s+" " " collapses newlines into spaces, fixing the multi-line spillover. truncate 100 caps length with an automatic “…” suffix.
The ## Optional section follows the llms.txt spec [5]. Agents that need a shorter context can skip it.
The long-term improvement: add a description field to each post’s frontmatter. This is the same field Hugo uses for <meta name="description"> in HTML, so it improves both SEO and agent discoverability with a single edit:
---
title: "Making My Website AI-Agent Friendly"
description: "Adding llms.txt, Markdown output, and CloudFront content negotiation to a Hugo site. Before/after: 51% fewer tokens, correct answers."
---
I deployed this change. The optimized llms.txt is live: 65KB (down from 121KB), zero spec violations, zero duplicate descriptions. The numbers in the comparison table above are measured against the production site.
What Strands Makes Easy
A few things worth noting about the Strands Agents SDK that made this experiment simple:
Tools are just functions
The @tool decorator turns any Python function into something the agent can call. The docstring becomes the tool description the model uses to decide when to invoke it. No schema files, no registration boilerplate.
Token tracking is built in
result.metrics.accumulated_usage gives you input, output, and total tokens across the entire agent interaction, including all tool calls. No manual counting needed.
The model decides the workflow
I didn’t hardcode “fetch llms.txt first, then fetch an article.” The system prompt suggests the approach, and the model figures out the execution order. V2’s agent autonomously decided which article was most relevant to the task from the 250+ available. The pattern is model-agnostic. Any model with tool-use support works. Smaller models with tighter context windows actually benefit more from the reduced token count.
The Pattern for Agent Builders
If you’re building agents that consume web content, here’s the practical takeaway:
-
Check for llms.txt first. One HTTP request tells you if the site is agent-friendly. If it exists, you get a structured catalog with direct links to clean content. If it doesn’t, fall back to HTML. The
Accept: text/markdownheader is ignored gracefully by servers that don’t support it. -
Use content negotiation. Send
Accept: text/markdownin your requests. Sites that support it will serve you clean Markdown from the same URL. Sites that don’t will serve HTML as usual. -
Prefer .md URLs when available. If llms.txt gives you direct Markdown links, use them. No content negotiation needed, no ambiguity about what you’ll receive.
-
Budget for discovery. A full llms.txt costs tokens to ingest, but it’s a one-time cost per site that enables precise content selection. Cache it across a session. The upfront investment amortizes across follow-up queries.
And if you’re a site owner generating llms.txt:
- Keep descriptions concise, or skip them. Titles alone are sufficient for well-named content. A 60KB concise catalog performs as well as a 120KB verbose one for article selection, at half the token cost. Save the detail for the articles themselves.
The Chicken-and-Egg Problem

Same website, different experience.
As I noted in the previous article [1], almost no AI agent actually uses llms.txt or content negotiation today. Dries Buytaert found zero Accept: text/markdown requests in a month of logs [3]. SonicLinker analyzed 2 million AI-agent requests and found zero requests for /llms.txt [4].
But here’s the thing: agents don’t request Markdown because almost no website offers it. And websites don’t offer it because no agent requests it. Someone has to break the cycle on both sides.
The previous article broke it on the producer side. This one breaks it on the consumer side. Two working agents, running against a real website, with real token counts. The infrastructure works. The discovery and efficiency gains are real. And as I showed in the previous article [1], the answer quality difference is equally dramatic. The same model gave a wrong answer from HTML and a correct answer from Markdown, for the same question. What’s missing is adoption.
If you’re building agents that consume web content, try it. The code above is copy-paste ready. Point it at any site with an llms.txt file and see the difference.
💬 Are you building agents that consume web content? What’s your approach to handling HTML noise? Have you tried llms.txt or content negotiation?
Sources:
[1] Making My Website AI-Agent Friendly — Here’s What Changed: schristoph.online
[2] Strands Agents SDK — Open-source AI agent framework from AWS: strandsagents.com / GitHub
[3] Dries Buytaert — Markdown, llms.txt and AI crawlers (March 2026): dri.es
[4] SonicLinker — We analyzed 2M AI-agent requests. None asked for llms.txt. (February 2026): soniclinker.com
[5] llms.txt Specification (Jeremy Howard, Answer.AI): llmstxt.org