MCP Strategies on AWS, Part 2: Tool Design in Code

written by Stefan Christoph

June 17, 2026 - 8 minutes read

TL;DR: The AWS Prescriptive Guidance paper on MCP gives the most useful, checkable rules in its tool-design section, so I wrote code to test them. A token-tax counter on a realistic 20-tool GitHub server shows minimal tool definitions cost about 92 tokens each, but the paper’s recommended form (output schemas, concrete examples, prompt-style descriptions) measures 346 tokens each, or 6,900 tokens for 20 tools, landing right in the paper’s 250-to-500 band. That is the honest reading: the band is the cost of following the guidance, and you pay it on every model call before the user types anything. A second demo shows the granularity trade-off: three granular tools force the model into three round-trips, while one coarse workflow tool does the same backend work in one. The code is runnable and uses the official MCP and Strands Agents SDKs.

In the series overview I said the tool-design section of the AWS MCP guidance was the part I would read first, because it gives hard numbers instead of platitudes. Hard numbers are testable. So I tested them.

This post is the tool-design pillar turned into code you can run. Everything here is in the companion code on GitHub, uses the official mcp and Strands Agents SDKs, and makes no network calls: the GitHub examples run against a mock backend.

The token tax is real, but the number depends on you

The guidance makes a specific claim: a typical tool definition costs roughly 250 to 500 tokens, so 20 tools cost 5,000 to 10,000 tokens on every model invocation, before any user input [1]. That is the cost of carrying tools in context.

I wanted to measure it rather than trust it, so I built a realistic 20-tool GitHub-style MCP server in the exact JSON shape a server returns from a tools/list call, and counted the tokens with a standard BPE tokenizer.

The first result surprised me. Minimal definitions (name, one-line description, a tight input schema) came in at about 92 tokens per tool, roughly 1,800 tokens for all twenty. That is well below the paper’s band.

That is not the paper being wrong. It is the paper assuming you follow the rest of its own advice. The guidance also says to add an output schema, to write descriptions as prompts that explain when to use the tool and what errors it can return, and (in its words, the single most effective technique) to provide concrete examples with real values. Each of those costs tokens. So I enriched the same twenty tools the way the paper recommends and measured again:

The measured output of token_tax.py on the same 20 tools, minimal versus the paper’s enriched form:

[minimal JSON (lower bound)]                                mean  91.5 tok/tool  ->  20 tools = 1831 tokens / invocation
[pretty JSON (as many SDKs send)]                           mean 155.5 tok/tool  ->  20 tools = 3110 tokens / invocation
[best-practice: +outputSchema +examples +prompt-style desc] mean 345.6 tok/tool  ->  20 tools = 6913 tokens / invocation

The enriched form lands at 346 tokens per tool, 6,913 tokens for twenty: squarely inside the paper’s 5,000-to-10,000 range. That reproduces the claim and, more usefully, explains it. The 250-to-500 band is the cost of good tool definitions. Every token you spend helping the model pick the right tool and fill it correctly is sent on every single call.

The counter itself is small. The heart of it is serializing each tool to the JSON the model sees and counting:

From token_tax.py, measuring one tool definition’s token cost:

def measure(tools, indent=None):
    per_tool = []
    for t in tools:
        serialized = (json.dumps(t, separators=(",", ":")) if indent is None
                      else json.dumps(t, indent=indent))
        per_tool.append((t["name"], count_tokens(serialized)))
    total = sum(tok for _, tok in per_tool)
    return {"per_tool": per_tool, "total": total,
            "mean": round(total / len(tools), 1)}

A caveat I want to be honest about: the tokenizer is a proxy. Different model families tokenize differently, so treat the absolute count as an order of magnitude, not a per-vendor figure. The relative comparison (minimal versus enriched) holds regardless of tokenizer, and that comparison is the point. This is the same reason the paper itself gives a range rather than a single number.

So what do you do with this? You do not strip your tool definitions to save tokens: that would trade a cheap context for a model that picks the wrong tool. You spend the tokens deliberately, and you keep the number of tools down. Which is the next rule.

Granular versus coarse-grained: counting round-trips

The central tool-design tension in the guidance is granularity. A granular design exposes one tool per API call. A coarse-grained design exposes one tool per workflow and hides the orchestration inside it. The paper’s rule: if a common workflow needs three or more separate calls, bundle them into one tool [1].

Granular design makes the model do three round-trips; coarse-grained does the same work in one — The granular design forces the model through three round-trips (create, label, assign). The coarse-grained tool does the same backend work in a single model call, with the sequence run deterministically inside the tool.

I built both designs for the same task (file and triage a GitHub issue) against a mock backend that counts API calls. The coarse tool runs create, then label, then assign, deterministically, inside one tool:

From github_tools.py, the coarse-grained workflow tool (Strands @tool):

@tool
def github_issue_file_and_triage(repo, title, body="", labels=None, assignee=None):
    """File an issue and triage it in one step: create, label, assign.
    The sequence runs deterministically inside the tool, so the model makes
    ONE call instead of three. Stays within the 8-parameter limit (5 params)."""
    issue = BACKEND.create_issue(repo, title, body)
    if labels:
        BACKEND.add_labels(repo, issue["number"], labels)
    if assignee:
        BACKEND.assign(repo, issue["number"], assignee)
    return issue

Running the comparison shows the difference plainly:

Measured output of github_tools.py:

GRANULAR design
  model-visible tool calls : 3
  backend API calls        : 3 ['POST /issues', 'POST /issues/labels', 'POST /issues/assignees']

COARSE-GRAINED design (paper's recommendation)
  model-visible tool calls : 1
  backend API calls        : 3 ['POST /issues', 'POST /issues/labels', 'POST /issues/assignees']

Same backend work, but the model round-trips drop from three to one. That is lower latency, lower cost, and (the part I care about most) no opportunity for the model to drop a step. A granular design lets the model create an issue and then forget to assign it. A coarse design cannot forget, because the sequence is code, not a model decision.

The nuance the paper is careful about: do not over-bundle. A tool with too many parameters or ambiguous intent is its own failure. Some APIs already encode a workflow (Amazon EC2’s RunInstances is the example), and there a tool-per-API is already workflow-oriented. The judgment call is whether the workflow is deterministic. If it is, bundle it. If it genuinely needs dynamic decisions, that is when the agent-as-tool pattern earns its complexity.

The rules you can put in CI

The remaining tool-design rules are mechanical enough to lint, so I wrote a linter. It checks four things from the guidance: eight parameters or fewer per tool, domain-noun-verb naming so an alphabetical sort clusters related operations, read and write verbs kept distinct so destructive actions are explicit, and a server split once it passes fifty tools.

From naming_and_params.py, the rules as checkable constants:

MAX_PARAMS = 8
MAX_TOOLS_PER_SERVER = 50
NAME_RE = re.compile(r"^[a-z0-9]+(_[a-z0-9]+){2,}$")  # domain_noun_verb
WRITE_VERBS = {"create", "update", "delete", "merge", "run", "remove", "set", "add"}
READ_VERBS = {"get", "list", "search", "read", "describe"}

Run it against the twenty well-designed tools and they pass. Inject a doEverything tool with nine parameters and a non-conforming name, and the linter flags it on both the parameter limit and the naming rule. That is the kind of guard worth running in CI, because tool sprawl happens one well-intentioned addition at a time, which is exactly how the customer I mentioned in the overview ended up with forty tools and a confused agent.

Why `domain-noun-verb` is more than tidiness

The naming standard looks cosmetic until you see what it does to a sorted list. Because the noun is the organizational boundary, sorting alphabetically clusters every operation on the same resource: github_issue_create, github_issue_get, github_issue_list, github_issue_update, then github_pullrequest_create, and so on. The model scans a list grouped by resource instead of a jumble, and you get name-collision protection for free. It is a small rule with a real effect on how legibly the toolset presents to the model, which, given the token tax, is the whole game.

Running it yourself

The companion code for this post is on GitHub at stechr/schristoph-blog-samples › 2026-06-12-mcp-strategies/tool-design. It runs with uv:

uv run --with tiktoken --with strands-agents python run_demo.py

That runs all three demos: the token tax, the granular-versus-coarse comparison, and the linter. The token measurement needs tiktoken; the Strands SDK is optional, because the GitHub tools fall back to a no-op decorator without it, so the comparison still runs offline.

Up next: Part 3, Hosting from Local to AWS

Tool design decides what your tools are. Part 3 decides where they live: hosting, from a local stdio server through a remote server bounded to a single AWS account, to a managed gateway. I deploy a real account-bounded MCP server to AWS, connect a SigV4-signing client to it, and tear it down, with the actual output. That is where the security model from the governance pillar starts to bite.

Sources

Model Context Protocol strategies on AWS (PDF) — the tool-design pillar, including the token math, the eight-parameter and fifty-tool limits, and the bundling rule.
Anthropic — Writing effective tools for AI agents — independent guidance on tool quality and context cost.
Model Context Protocol specification — the tools/list shape the token counter measures against.

About the Author

Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.

This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.

Learn more →

Cross-posted to LinkedIn

❤️ Created with the support of AI (Kiro)