MCP Strategies on AWS, Part 2: Tool Design in Code
written by Stefan Christoph
- 8 minutes readIn the series overview I said the tool-design section of the AWS MCP guidance was the part I would read first, because it gives hard numbers instead of platitudes. Hard numbers are testable. So I tested them.
This post is the tool-design pillar turned into code you can run. Everything here is in the companion code on GitHub, uses the official mcp and Strands Agents SDKs, and makes no network calls: the GitHub examples run against a mock backend.
The token tax is real, but the number depends on you
The guidance makes a specific claim: a typical tool definition costs roughly 250 to 500 tokens, so 20 tools cost 5,000 to 10,000 tokens on every model invocation, before any user input [1]. That is the cost of carrying tools in context.
I wanted to measure it rather than trust it, so I built a realistic 20-tool GitHub-style MCP server in the exact JSON shape a server returns from a tools/list call, and counted the tokens with a standard BPE tokenizer.
The first result surprised me. Minimal definitions (name, one-line description, a tight input schema) came in at about 92 tokens per tool, roughly 1,800 tokens for all twenty. That is well below the paper’s band.
That is not the paper being wrong. It is the paper assuming you follow the rest of its own advice. The guidance also says to add an output schema, to write descriptions as prompts that explain when to use the tool and what errors it can return, and (in its words, the single most effective technique) to provide concrete examples with real values. Each of those costs tokens. So I enriched the same twenty tools the way the paper recommends and measured again:
The measured output of token_tax.py on the same 20 tools, minimal versus the paper’s enriched form:
[minimal JSON (lower bound)] mean 91.5 tok/tool -> 20 tools = 1831 tokens / invocation
[pretty JSON (as many SDKs send)] mean 155.5 tok/tool -> 20 tools = 3110 tokens / invocation
[best-practice: +outputSchema +examples +prompt-style desc] mean 345.6 tok/tool -> 20 tools = 6913 tokens / invocation
The enriched form lands at 346 tokens per tool, 6,913 tokens for twenty: squarely inside the paper’s 5,000-to-10,000 range. That reproduces the claim and, more usefully, explains it. The 250-to-500 band is the cost of good tool definitions. Every token you spend helping the model pick the right tool and fill it correctly is sent on every single call.
The counter itself is small. The heart of it is serializing each tool to the JSON the model sees and counting:
From token_tax.py, measuring one tool definition’s token cost:
def measure(tools, indent=None):
per_tool = []
for t in tools:
serialized = (json.dumps(t, separators=(",", ":")) if indent is None
else json.dumps(t, indent=indent))
per_tool.append((t["name"], count_tokens(serialized)))
total = sum(tok for _, tok in per_tool)
return {"per_tool": per_tool, "total": total,
"mean": round(total / len(tools), 1)}
A caveat I want to be honest about: the tokenizer is a proxy. Different model families tokenize differently, so treat the absolute count as an order of magnitude, not a per-vendor figure. The relative comparison (minimal versus enriched) holds regardless of tokenizer, and that comparison is the point. This is the same reason the paper itself gives a range rather than a single number.
So what do you do with this? You do not strip your tool definitions to save tokens: that would trade a cheap context for a model that picks the wrong tool. You spend the tokens deliberately, and you keep the number of tools down. Which is the next rule.
Granular versus coarse-grained: counting round-trips
The central tool-design tension in the guidance is granularity. A granular design exposes one tool per API call. A coarse-grained design exposes one tool per workflow and hides the orchestration inside it. The paper’s rule: if a common workflow needs three or more separate calls, bundle them into one tool [1].
The granular design forces the model through three round-trips (create, label, assign). The coarse-grained tool does the same backend work in a single model call, with the sequence run deterministically inside the tool.
I built both designs for the same task (file and triage a GitHub issue) against a mock backend that counts API calls. The coarse tool runs create, then label, then assign, deterministically, inside one tool:
From github_tools.py, the coarse-grained workflow tool (Strands @tool):
@tool
def github_issue_file_and_triage(repo, title, body="", labels=None, assignee=None):
"""File an issue and triage it in one step: create, label, assign.
The sequence runs deterministically inside the tool, so the model makes
ONE call instead of three. Stays within the 8-parameter limit (5 params)."""
issue = BACKEND.create_issue(repo, title, body)
if labels:
BACKEND.add_labels(repo, issue["number"], labels)
if assignee:
BACKEND.assign(repo, issue["number"], assignee)
return issue
Running the comparison shows the difference plainly:
Measured output of github_tools.py:
GRANULAR design
model-visible tool calls : 3
backend API calls : 3 ['POST /issues', 'POST /issues/labels', 'POST /issues/assignees']
COARSE-GRAINED design (paper's recommendation)
model-visible tool calls : 1
backend API calls : 3 ['POST /issues', 'POST /issues/labels', 'POST /issues/assignees']
Same backend work, but the model round-trips drop from three to one. That is lower latency, lower cost, and (the part I care about most) no opportunity for the model to drop a step. A granular design lets the model create an issue and then forget to assign it. A coarse design cannot forget, because the sequence is code, not a model decision.
The nuance the paper is careful about: do not over-bundle. A tool with too many parameters or ambiguous intent is its own failure. Some APIs already encode a workflow (Amazon EC2’s RunInstances is the example), and there a tool-per-API is already workflow-oriented. The judgment call is whether the workflow is deterministic. If it is, bundle it. If it genuinely needs dynamic decisions, that is when the agent-as-tool pattern earns its complexity.
The rules you can put in CI
The remaining tool-design rules are mechanical enough to lint, so I wrote a linter. It checks four things from the guidance: eight parameters or fewer per tool, domain-noun-verb naming so an alphabetical sort clusters related operations, read and write verbs kept distinct so destructive actions are explicit, and a server split once it passes fifty tools.
From naming_and_params.py, the rules as checkable constants:
MAX_PARAMS = 8
MAX_TOOLS_PER_SERVER = 50
NAME_RE = re.compile(r"^[a-z0-9]+(_[a-z0-9]+){2,}$") # domain_noun_verb
WRITE_VERBS = {"create", "update", "delete", "merge", "run", "remove", "set", "add"}
READ_VERBS = {"get", "list", "search", "read", "describe"}
Run it against the twenty well-designed tools and they pass. Inject a doEverything tool with nine parameters and a non-conforming name, and the linter flags it on both the parameter limit and the naming rule. That is the kind of guard worth running in CI, because tool sprawl happens one well-intentioned addition at a time, which is exactly how the customer I mentioned in the overview ended up with forty tools and a confused agent.
Why domain-noun-verb is more than tidiness
The naming standard looks cosmetic until you see what it does to a sorted list. Because the noun is the organizational boundary, sorting alphabetically clusters every operation on the same resource: github_issue_create, github_issue_get, github_issue_list, github_issue_update, then github_pullrequest_create, and so on. The model scans a list grouped by resource instead of a jumble, and you get name-collision protection for free. It is a small rule with a real effect on how legibly the toolset presents to the model, which, given the token tax, is the whole game.
Running it yourself
The companion code for this post is on GitHub at stechr/schristoph-blog-samples › 2026-06-12-mcp-strategies/tool-design. It runs with uv:
uv run --with tiktoken --with strands-agents python run_demo.py
That runs all three demos: the token tax, the granular-versus-coarse comparison, and the linter. The token measurement needs tiktoken; the Strands SDK is optional, because the GitHub tools fall back to a no-op decorator without it, so the comparison still runs offline.
Up next: Part 3, Hosting from Local to AWS
Tool design decides what your tools are. Part 3 decides where they live: hosting, from a local stdio server through a remote server bounded to a single AWS account, to a managed gateway. I deploy a real account-bounded MCP server to AWS, connect a SigV4-signing client to it, and tear it down, with the actual output. That is where the security model from the governance pillar starts to bite.
Sources
- Model Context Protocol strategies on AWS (PDF) — the tool-design pillar, including the token math, the eight-parameter and fifty-tool limits, and the bundling rule.
- Anthropic — Writing effective tools for AI agents — independent guidance on tool quality and context cost.
- Model Context Protocol specification — the
tools/listshape the token counter measures against.
About the Author
Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.
This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.
❤️ Created with the support of AI (Kiro)