Building Agents That Read the Web Right

April 14, 2026

The Other Side of the Coin

In a recent article, I made my website AI-agent friendly [1], adding llms.txt, Markdown output, and content negotiation to a Hugo site on AWS. That article was about the producer side: how to serve content that agents can consume efficiently.

But I left a question unanswered: what does it look like from the agent’s perspective?

In this article, I’m building two agents. Same task, same website, same model. The only difference: one reads the web the old way, the other uses the infrastructure I just built. The code is written with the Strands Agents SDK [2], an open-source framework from AWS for building AI agents in Python.

The Experiment

The task is simple: research my blog, discover what articles are available, pick one about AI agents, and summarize it.

Two agents attempt this:

Agent V1 discovers 7 articles from HTML, Agent V2 discovers 250+ from llms.txt — Same task, same number of requests. Different discovery, different format.

Agent V1 fetches the homepage HTML, parses what it can see, then fetches an article as HTML
Agent V2 fetches llms.txt first, discovers the full site catalog, then fetches an article as clean Markdown

Both use Claude Sonnet on Amazon Bedrock. Both make the same number of HTTP requests. The difference is what they request and what they receive. The code is copy-paste ready. You can point it at any site with an llms.txt file, not just mine.

Agent V1: The HTML Struggle

The first agent has one tool: fetch_webpage. It fetches URLs and returns raw HTML.

"""
Agent V1: The Struggle — Researching a website using standard web search.
"""

from strands import Agent, tool
from strands.models import BedrockModel
import urllib.request
import urllib.error


@tool
def fetch_webpage(url: str) -> str:
    """Fetch a webpage and return its raw HTML content.

    Args:
        url: The URL to fetch.

    Returns:
        The raw HTML content of the page.
    """
    req = urllib.request.Request(
        url,
        headers={"User-Agent": "StrandsResearchAgent/1.0"},
    )
    try:
        with urllib.request.urlopen(req, timeout=15) as resp:
            content = resp.read().decode("utf-8", errors="replace")
            content_type = resp.headers.get("Content-Type", "unknown")
            size = len(content.encode("utf-8"))
            return (
                f"[Fetched {url}]\n"
                f"[Content-Type: {content_type}]\n"
                f"[Size: {size:,} bytes]\n\n"
                f"{content}"
            )
    except urllib.error.URLError as e:
        return f"Error fetching {url}: {e}"


model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="us-west-2",
)

agent = Agent(
    model=model,
    tools=[fetch_webpage],
    system_prompt=(
        "You are a research agent. Your task is to research content on "
        "websites using the fetch_webpage tool. Analyze what you receive "
        "and extract useful information. Be explicit about any difficulties "
        "you encounter with the content format."
    ),
)

result = agent(
    "Research the blog at https://schristoph.online. "
    "First, fetch the homepage to see what articles are available. "
    "Then fetch one article about AI agents and summarize its key points. "
    "Report on: (1) how many articles you found, (2) the key points of "
    "the article you read, and (3) any difficulties with the content format."
)

# Token usage
usage = result.metrics.accumulated_usage
print(f"Input tokens:  {usage['inputTokens']:,}")
print(f"Output tokens: {usage['outputTokens']:,}")
print(f"Total tokens:  {usage['totalTokens']:,}")

The agent fetches the homepage, receives 17,886 bytes of HTML: navigation bars, CSS classes, meta tags, structured data, footer links. Somewhere in there are article titles and excerpts. The model has to parse through all of it to find the content.

Result: The agent found 7 articles. Only what’s rendered on the homepage. My blog has over 250 posts. The agent is blind to more than 95% of the content because it’s paginated or linked from archive pages it never visits.

It then fetches one article as HTML (26,156 bytes), extracts the content, and produces a reasonable summary. But it consumed 33,894 input tokens to get there. Nearly half of that is HTML noise the model had to process and discard.

Agent V2: The Smart Way

The second agent has two tools: fetch_llms_txt for discovery and fetch_markdown for content.

"""
Agent V2: The Smart Way — Using llms.txt and content negotiation.
"""

from strands import Agent, tool
from strands.models import BedrockModel
import urllib.request
import urllib.error


@tool
def fetch_llms_txt(base_url: str) -> str:
    """Fetch the llms.txt file from a website to discover its structure.

    The llms.txt standard provides a structured index of a site's content
    with direct links to Markdown versions of each page.

    Args:
        base_url: The base URL of the website (e.g. https://schristoph.online).

    Returns:
        The contents of the llms.txt file, or an error message.
    """
    url = f"{base_url.rstrip('/')}/llms.txt"
    req = urllib.request.Request(
        url,
        headers={"User-Agent": "StrandsResearchAgent/2.0"},
    )
    try:
        with urllib.request.urlopen(req, timeout=15) as resp:
            content = resp.read().decode("utf-8", errors="replace")
            size = len(content.encode("utf-8"))
            return (
                f"[Fetched llms.txt from {base_url}]\n"
                f"[Size: {size:,} bytes]\n\n"
                f"{content}"
            )
    except urllib.error.URLError as e:
        return f"No llms.txt found at {url}: {e}"


@tool
def fetch_markdown(url: str) -> str:
    """Fetch a page as clean Markdown using content negotiation.

    Sends Accept: text/markdown to request Markdown format. If the
    URL already ends in .md, fetches it directly.

    Args:
        url: The URL to fetch (can be an .md URL from llms.txt).

    Returns:
        The Markdown content of the page.
    """
    req = urllib.request.Request(
        url,
        headers={
            "User-Agent": "StrandsResearchAgent/2.0",
            "Accept": "text/markdown, text/plain;q=0.9, */*;q=0.1",
        },
    )
    try:
        with urllib.request.urlopen(req, timeout=15) as resp:
            content = resp.read().decode("utf-8", errors="replace")
            content_type = resp.headers.get("Content-Type", "unknown")
            size = len(content.encode("utf-8"))
            return (
                f"[Fetched {url}]\n"
                f"[Content-Type: {content_type}]\n"
                f"[Size: {size:,} bytes]\n\n"
                f"{content}"
            )
    except urllib.error.URLError as e:
        return f"Error fetching {url}: {e}"


model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="us-west-2",
)

agent = Agent(
    model=model,
    tools=[fetch_llms_txt, fetch_markdown],
    system_prompt=(
        "You are a research agent that uses the llms.txt standard and "
        "content negotiation to efficiently consume website content. "
        "Always start by fetching llms.txt to discover the site structure. "
        "Then use the Markdown URLs from llms.txt to fetch articles. "
        "Report on the content format you received and its quality."
    ),
)

result = agent(
    "Research the blog at https://schristoph.online. "
    "First, discover the site structure using llms.txt. "
    "Then fetch one article about AI agents and summarize its key points. "
    "Report on: (1) how many articles you found, (2) the key points of "
    "the article you read, and (3) the quality of the content format."
)

# Token usage
usage = result.metrics.accumulated_usage
print(f"Input tokens:  {usage['inputTokens']:,}")
print(f"Output tokens: {usage['outputTokens']:,}")
print(f"Total tokens:  {usage['totalTokens']:,}")

The agent fetches llms.txt, a structured catalog of every article on the site, with titles, summaries, and direct links to Markdown versions. One request, and it knows the entire site.

Result: The agent discovered all 250+ articles. It picked the most relevant one, fetched it as clean Markdown (15,230 bytes, zero HTML noise), and produced a detailed, accurate summary.

The Numbers

Here’s where it gets interesting. And honest.

Per-Article: Markdown Wins Clearly

For the same article, fetched as HTML vs Markdown:

Metric	HTML	Markdown	Change
Response size	26,156 bytes	15,230 bytes	-42%
HTML tags to parse	439	0
Nav/header/footer elements	4	0
Noise ratio	47%	~0%
Content-Type	text/html	text/markdown

The Markdown version is 42% smaller and 100% signal. No navigation, no CSS classes, no meta tags. Just the article content with its headings, code blocks, and links intact. Markdown still has structural elements (headings, image references, citations), but those are semantic structure that helps the agent understand the content, unlike HTML’s presentational noise.

End-to-End: Not What I Expected

Metric	V1 (HTML)	V2 (llms.txt + MD)
Articles discovered	7	250+
Discovery coverage	<5% of site	100% of site
Article content format	HTML (47% noise)	Markdown (0% noise)
Input tokens	33,894	104,979
Output tokens	1,119	1,155
Tool calls	3	3

I expected V2 to use fewer tokens. It used three times more.

The reason: my llms.txt file is 121KB. It contains titles and opening paragraphs for all 250+ posts, including Unicode bold characters from LinkedIn-style posts that are multi-byte. The agent ingests the entire catalog in one request, and that catalog is bigger than the homepage HTML.

I went in expecting a story about token efficiency. What I got was a story about discovery quality. V1 found 7 articles, whatever the homepage renders. V2 found all of them. Same number of tool calls, but fundamentally different awareness of what exists.

“But Couldn’t the Agent Just Crawl?”

A robot with a map vs a robot lost in a maze — One has the full picture. The other is guessing.

Fair question. An agent with a basic fetch_webpage tool could discover all articles, the same way a human would. Follow the “Older posts” link, parse the next page, repeat. Or use a search engine: site:schristoph.online AI agents.

The question isn’t whether it’s possible. It’s what it costs.

Approach	Requests	Tokens (est.)	Coverage	Dependencies
Homepage only	1	~18K	5%	None
Crawl all pagination	~15	~270K	100%	Consistent HTML structure
sitemap.xml + fetch	1 + N	varies	100%	Fetches HTML, includes non-content pages
External search	1-2	varies	Partial, relevance-ranked	Search API key
llms.txt (verbose)	1	~105K	100%	Site implements standard
llms.txt (optimized)	1	~63K	100%	Site implements standard

Crawling pagination would cost ~15 requests and ~270K tokens of mostly HTML noise (navigation bars, footers, and CSS classes repeated on every page). External search gives you 10-20 results ranked by someone else’s algorithm, not the site owner’s structure. And you’re adding a third-party dependency.

llms.txt gives complete coverage in a single request. The site owner curates what the agent sees. No parsing ambiguity, no dependency on HTML structure staying consistent, no search API key.

The advantage isn’t “can vs can’t.” It’s 1 structured request vs 15 noisy ones.

The Verbosity Trade-Off

That 121KB llms.txt bothered me. I looked at what others are doing:

Site	llms.txt Size	Links	Style
FastHTML (spec author)	4.8 KB	~20	Titles + short descriptions
Strands Agents	42 KB	~450	Titles only, nested hierarchy
Stripe Docs	92 KB	~650	Titles + one-line descriptions
Vercel Docs	332 KB	~1,900	Titles + one-line descriptions
My blog	121 KB	279	Titles + ~160 char excerpts

My file is disproportionately large for its link count. The culprit: Hugo’s .Summary template dumps the article’s opening paragraph as the description. Conversational openers, Unicode bold characters, emoji. Content that helps a human browsing a feed, but doesn’t help an agent decide which article to fetch.

The llms.txt spec [5] is intentionally minimal about this. It says descriptions are optional, each link can have a : description after it, or not. The guidance is simply: “Use concise, clear language.”

So I created a concise variant (titles only, no descriptions) and ran the same experiment. Then I went further: I actually fixed the Hugo template and deployed it. The optimized llms.txt is now live on my site. Zero multi-line spillover, truncated descriptions, long LinkedIn-style titles without redundant descriptions.

Here are the real numbers, all measured against the production site:

Metric	V1 (HTML)	V2 (old llms.txt)	V2 (optimized llms.txt)
Catalog size	17,886 B	121,321 B	65,003 B
Articles discovered	7	250+	250+
Discovery coverage	<5%	100%	100%
Input tokens	33,894	104,979	62,740
Total tokens	35,013	106,134	63,552
Tool calls	3	3	3

The optimized llms.txt cuts tokens by 40% compared to the verbose version, while maintaining complete discovery. The file went from 121KB to 65KB, zero spec violations, zero duplicate descriptions.

The honest math: even the optimized version uses ~29K more tokens than V1’s homepage-only approach. But those 29K tokens buy you 100% site coverage instead of 5%. That’s the real trade-off: ~29K extra tokens for complete discovery vs saving tokens and being blind to 95% of the content.

For well-titled blog posts, titles alone are enough for an agent to pick the right article. Descriptions add value when titles are ambiguous, but for most content sites, a concise llms.txt is the sweet spot.

Fixing the Template

So I dug into the Hugo template to fix it. Diagnosing why my llms.txt was bloated revealed three issues:

Issue	Count	Impact
Description duplicates title	217 / 279 (78%)	Tokens wasted repeating information
Multi-line description spillover	187 lines	Breaks the llms.txt spec format
Verbose descriptions (~160 chars avg)	277 / 279	Opening paragraphs, not summaries

The root cause: Hugo’s .Summary dumps the article’s opening paragraph, including line breaks that spill into the next line, violating the spec’s one-entry-per-line format. For LinkedIn-style short posts, the title is the first line of content, so the description just repeats it.

The fix is a smarter template that applies three rules:

Prefer frontmatter description — if the author wrote a hand-crafted summary, use it
Skip descriptions for long titles — posts with 80+ character titles (typical LinkedIn reposts) are self-descriptive
Collapse and truncate everything else — strip formatting, collapse whitespace, cap at 100 characters

The Hugo template below is the concrete implementation, but the principles apply to any static site generator or CMS. WordPress, Next.js, Jekyll, Astro all have equivalent template mechanisms.

# {{ .Site.Title }}

> {{ .Site.Params.description }}

## Blog Posts

{{ range (where .Site.RegularPages "Section" "blog").ByDate.Reverse -}}
{{- $title := .Title -}}
{{- $url := printf "%sindex.md" .Permalink -}}
{{- $desc := "" -}}
{{- if .Params.description -}}
  {{- $desc = .Params.description -}}
{{- else if le (len $title) 80 -}}
  {{- $desc = .Summary | plainify | replaceRE "\\s+" " " | truncate 100 -}}
{{- end -}}
{{- if $desc -}}
- [{{ $title }}]({{ $url }}): {{ $desc }}
{{ else -}}
- [{{ $title }}]({{ $url }})
{{ end -}}
{{- end }}
## Optional

- [About]({{ "about/" | absURL }})
- [Imprint]({{ "imprint/" | absURL }})

The key functions: plainify strips HTML/Markdown formatting. replaceRE "\\s+" " " collapses newlines into spaces, fixing the multi-line spillover. truncate 100 caps length with an automatic “…” suffix.

The ## Optional section follows the llms.txt spec [5]. Agents that need a shorter context can skip it.

The long-term improvement: add a description field to each post’s frontmatter. This is the same field Hugo uses for <meta name="description"> in HTML, so it improves both SEO and agent discoverability with a single edit:

---
title: "Making My Website AI-Agent Friendly"
description: "Adding llms.txt, Markdown output, and CloudFront content negotiation to a Hugo site. Before/after: 51% fewer tokens, correct answers."
---

I deployed this change. The optimized llms.txt is live: 65KB (down from 121KB), zero spec violations, zero duplicate descriptions. The numbers in the comparison table above are measured against the production site.

What Strands Makes Easy

A few things worth noting about the Strands Agents SDK that made this experiment simple:

Tools are just functions

The @tool decorator turns any Python function into something the agent can call. The docstring becomes the tool description the model uses to decide when to invoke it. No schema files, no registration boilerplate.

Token tracking is built in

result.metrics.accumulated_usage gives you input, output, and total tokens across the entire agent interaction, including all tool calls. No manual counting needed.

The model decides the workflow

I didn’t hardcode “fetch llms.txt first, then fetch an article.” The system prompt suggests the approach, and the model figures out the execution order. V2’s agent autonomously decided which article was most relevant to the task from the 250+ available. The pattern is model-agnostic. Any model with tool-use support works. Smaller models with tighter context windows actually benefit more from the reduced token count.

The Pattern for Agent Builders

If you’re building agents that consume web content, here’s the practical takeaway:

Check for llms.txt first. One HTTP request tells you if the site is agent-friendly. If it exists, you get a structured catalog with direct links to clean content. If it doesn’t, fall back to HTML. The Accept: text/markdown header is ignored gracefully by servers that don’t support it.
Use content negotiation. Send Accept: text/markdown in your requests. Sites that support it will serve you clean Markdown from the same URL. Sites that don’t will serve HTML as usual.
Prefer .md URLs when available. If llms.txt gives you direct Markdown links, use them. No content negotiation needed, no ambiguity about what you’ll receive.
Budget for discovery. A full llms.txt costs tokens to ingest, but it’s a one-time cost per site that enables precise content selection. Cache it across a session. The upfront investment amortizes across follow-up queries.

And if you’re a site owner generating llms.txt:

Keep descriptions concise, or skip them. Titles alone are sufficient for well-named content. A 60KB concise catalog performs as well as a 120KB verbose one for article selection, at half the token cost. Save the detail for the articles themselves.

The Chicken-and-Egg Problem

Two robots researching the same website, one drowning in HTML, one reading clean Markdown — Same website, different experience.

As I noted in the previous article [1], almost no AI agent actually uses llms.txt or content negotiation today. Dries Buytaert found zero Accept: text/markdown requests in a month of logs [3]. SonicLinker analyzed 2 million AI-agent requests and found zero requests for /llms.txt [4].

But here’s the thing: agents don’t request Markdown because almost no website offers it. And websites don’t offer it because no agent requests it. Someone has to break the cycle on both sides.

The previous article broke it on the producer side. This one breaks it on the consumer side. Two working agents, running against a real website, with real token counts. The infrastructure works. The discovery and efficiency gains are real. And as I showed in the previous article [1], the answer quality difference is equally dramatic. The same model gave a wrong answer from HTML and a correct answer from Markdown, for the same question. What’s missing is adoption.

If you’re building agents that consume web content, try it. The code above is copy-paste ready. Point it at any site with an llms.txt file and see the difference.

💬 Are you building agents that consume web content? What’s your approach to handling HTML noise? Have you tried llms.txt or content negotiation?

Sources:

[1] Making My Website AI-Agent Friendly — Here’s What Changed: schristoph.online

[2] Strands Agents SDK — Open-source AI agent framework from AWS: strandsagents.com / GitHub

[3] Dries Buytaert — Markdown, llms.txt and AI crawlers (March 2026): dri.es

[4] SonicLinker — We analyzed 2M AI-agent requests. None asked for llms.txt. (February 2026): soniclinker.com

[5] llms.txt Specification (Jeremy Howard, Answer.AI): llmstxt.org