AI-Assisted Talk Prep: The Recording-Analysis-Iterate Loop

written by Stefan Christoph

May 28, 2026 - 19 minutes read

TL;DR: I prepared my AWS Summit Hamburg talk through 6 recorded dry runs, each analyzed by a structured AI skill. The skill transcribes via Amazon Transcribe, counts fillers with word-level timestamps, checks customer references against a checklist, flags pronunciation issues, identifies filler-dense windows, and proposes tighter alternatives. This post shows the complete skill file with every step explained, the real results (um: 80→27, customer refs: 0/6→4/6), and the key discovery: the skill and the talk improved together. Each dry run made both better. Total cost: $3.60 for 6 analyses.

Why This Exists

I have always done dry runs before conference talks. Record myself, run through the slides, get into the moment. But here is the thing: I never listened to the recordings. The recording was a focus tool. It put me in presentation mode, made me commit to the flow. But the actual audio file? Deleted without playback.

Then there are peer reviews. A colleague watches your dry run and gives feedback. This is genuinely valuable: you get high-level observations (“the middle section drags”), and sometimes you get truly out-of-the-box human feedback that no automated system would produce. “Your body language changed when you hit the cost slide, you looked uncertain.” That kind of insight is irreplaceable.

But peer reviews have two blind spots.

First, they cannot give you the detailed mechanical feedback. No human will count your filler words. No one will tell you that you said “um” 80 times, that your fillers cluster between minutes 13 and 22, or that you mispronounced “Anthropic” twice in the second half when fatigue set in. That level of granularity is invisible to a live observer. They are focused on content and flow, not counting syllables. And the research is clear: listeners notice filler words in 71% of speakers and interpret them as lack of preparation [7], while speakers with fewer fillers are perceived as more credible and authoritative [6].

Second, they lack your full context. A peer reviewer does not know which customer references you planned to include, which session codes you intended to mention, or what you actually wanted to say on slide 37. They cannot tell you that you forgot to reference Deutsche Bahn in the reliability section because they did not know it was supposed to be there. They evaluate what you said, not the gap between what you said and what you intended.

I wanted both: the irreplaceable human feedback from peer reviews, and the quantified mechanical feedback that only a system with full context can provide. So I built a skill that knows my talk — the speaker notes, the customer references, the pronunciation traps — and analyzes every recording against that intention.

Context Is Everything: The Preparation Before the Skill

The analysis skill did not appear in isolation. Before the first dry run, I had already co-developed two critical context artifacts with my AI assistant:

1. Speaker notes. We jointly wrote the speaker notes for all 69 slides — extracting the re:Invent source material, adapting it for a Summit audience, and scripting the key transitions. The AI knew exactly what I intended to say on each slide.

2. Customer reference mapping. We analyzed the full AI track session catalog — every session, every customer, every time slot — and selected 6 customer references that I could naturally weave into my talk as “teasers” for other sessions [2]. The AI knew which customers belonged to which pillar and which session codes to mention.

This context is what makes the analysis skill powerful. It does not analyze a generic talk. It analyzes my talk against my intentions. It knows that “Delivery Hero” should appear in the cost optimization section, that “HUK-Coburg” belongs in the data privacy section, and that “Infineon” should be pronounced correctly — not as “Finian” or “Infinium.”

The Pipeline: 8 Steps, One Skill File

Here is the complete pipeline. Each step runs sequentially, building on the outputs of previous steps:

Talk analysis pipeline: Recording → Transcribe → Fillers → References → Pronunciation → Distribution → Timing → Analysis Note — The 7-step analysis pipeline. Each run produces a versioned markdown note that compares against all prior runs.

The entire skill is a single markdown file (~120 lines). Let me walk through each step with the actual instructions.

Step 1: Transcribe

The skill instructions for transcription — what the agent executes when it processes a recording:

# Step 1: Transcribe

Constraints:
- Upload the recording to s3://{bucket}/ via AWS CLI
- Start an Amazon Transcribe job with:
    --language-code en-US
    --output-bucket-name {bucket}
- Poll for completion (15s intervals, max 20 attempts)
- Download the JSON result and extract both the plain transcript
  and the word-level timestamps
- Save the transcript to /tmp/ for analysis
- Report the total duration from ffprobe before transcription starts

Why Amazon Transcribe and not a local model? Word-level timestamps. The JSON output includes start and end times for every single word. This is what enables Step 5 (filler distribution by time window) and Step 6b (words-per-minute analysis). Without per-word timing, you can count fillers but you cannot locate them.

Tools used: shell for ffprobe (duration check), use_aws for S3 upload and Transcribe API calls (see Tools Under the Hood).

Step 2: Filler Word Count

The filler detection rules — each one added after we spotted the pattern in real transcripts:

# Step 2: Filler Word Count

Constraints:
- Count these fillers as standalone words (not inside other words):
    - "um" (including "Um")
    - "uh" (including "Uh")
    - "right?" (rhetorical — followed by question mark or pause)
    - "so" (sentence starter — after period or at start)
    - "basically"
    - "things like that" / "and things like that"
- Report each count individually AND a combined um+uh total
- Compare against prior runs if analyses exist in the talk folder

The key design decision: “so” only counts as a filler when it starts a sentence (after a period or at the beginning). Mid-sentence “so” is normal connective tissue. This distinction matters: without it, you get false positives that make the count meaningless.

This filler list was not designed upfront. I explicitly asked the AI to check for “um” and “uh” after the first run. But when we reviewed the v1 transcript together, we noticed other patterns: “right?” used rhetorically after every other statement, “basically” as a crutch when simplifying, “and things like that” as a vague closer. Each one got added to the checklist after we spotted it in real data. The skill grew with the practice.

The comparison against prior runs is what makes this a progression tool, not a one-shot analysis. Each run builds on the history.

Tools used: LLM reasoning over the transcript text (see Tools Under the Hood). The agent reads the transcript file and counts patterns — no external service needed.

Step 3: Customer References / Checklist

The reference checker — verifying I mentioned the right customers in the right sections:

# Step 3: Customer References / Checklist

Constraints:
- Search the transcript for each item in the checklist
- Also search for common garbled variants:
    "Infineon" → "Finian", "Infinium"
    "HUK-Coburg" → "Hoek", "co-work", "Kubberg"
- Report: found clearly / found but garbled / missing
- Compare against prior runs

The garbled variants are critical. Amazon Transcribe does its best with proper nouns, but “HUK-Coburg” consistently transcribes as “Hoek Kubberg.” Without the variant list, the skill would report it as missing when I actually said it. The audience would understand “Hoek Kubberg.” The transcription service does not.

One factor worth noting: I am a German speaker presenting in English to a predominantly German audience. Some of the “garbled” pronunciations are simply my German accent interacting with English proper nouns. “Infineon” is a German company — I pronounce it the German way, which Transcribe interprets as “Infinium.” The audience in Hamburg would understand perfectly. The skill needs to account for this gap between how the transcription model hears me and how a German-speaking audience would hear me.

The default checklist for my talk:

customers:
  - Delivery Hero (AIM302, 11:30)
  - HUK-Coburg (AIM310, 14:30)
  - Infineon (AIM311, 14:30)
  - Deutsche Bahn (AIM301, 13:30)
  - N26 (AIM303, 15:30)
  - Siemens (AIM202, 12:30)
pronunciation:
  - Strands Agents (not "strengths")
  - Anthropic (not "Enthropic")
  - AgentCore (not "aging core")
  - Werner Vogels (not "Vernavous")
  - Bedrock (not "patchwork")
  - Infineon (not "Finian")

Each customer maps to a specific session code and time slot. The idea: I reference them in my talk as “if you want to see how Delivery Hero cut their LLM costs by 90%, join AIM302 at 11:30.” This gives the audience actionable next steps and fills other sessions.

Step 4: Pronunciation Check

The pronunciation watchlist — catching mispronunciations that would embarrass on stage:

# Step 4: Pronunciation Check

Constraints:
- Search for known pronunciation issues:
    "strengths agents" (should be "Strands")
    "engentic" (should be "agentic")
    "Enthropic" (should be "Anthropic")
    "aging core" / "Edincore" (should be "AgentCore")
    "Vernavous" (should be "Vogels")
    "batchrock" / "patchwork" (should be "Bedrock")
- Report each occurrence with the correct pronunciation
- Flag new pronunciation issues not seen in prior runs

This step exists because of a discovery in v2: I was consistently saying “strengths agents” instead of “Strands Agents.” The transcription service faithfully captured it. Without this check, I would have said it wrong on stage in front of 300 people.

The “flag new issues” constraint is important. Each run might reveal a new mispronunciation that was not in the original list. The skill adapts.

Step 5: Filler Distribution Analysis

The locator — not just how many fillers, but where and why they cluster:

# Step 5: Filler Distribution Analysis

Constraints:
- Use the word-level timestamps from the Transcribe JSON
  to bucket fillers into 5-minute intervals
- Identify the worst 2-minute windows (highest filler density)
- Extract the actual text from the worst windows
- Identify the anti-pattern (usually: enumerating lists in prose)
- Propose a tighter alternative for each worst window
- State the universal fix:
  "Point at slide → state the principle → pause → move on"

This is the most valuable step. Counting fillers tells you how many. Distribution tells you where and why.

My consistent pattern across 6 runs: fillers cluster in minutes 13-22 (the middle third) where I enumerate AWS features and walk through architecture diagrams. The opening (storytelling, scripted) and closing (call-to-action, scripted) are clean.

The anti-pattern the skill identifies every time: narrating bullet points in prose form, with “um” as the separator between items. Instead of pointing at the slide and stating the principle, I describe what the diagram shows, which requires improvisation, which produces fillers.

Tools used: LLM reasoning over the Transcribe JSON (see Tools Under the Hood). The agent parses the JSON, buckets words by time window, and identifies density peaks — all in-context, no scripts.

Step 6: Timing Analysis

The time optimizer — finding sections where I over-explain and proposing tighter alternatives:

# Step 6: Timing Analysis

Constraints:
- Identify key section boundaries using transcript text + timestamps
- Flag sections over 3 minutes as potential time sinks
- Extract what was said in long sections and propose tighter alternatives
- Calculate: if all tighter alternatives were used,
  what would the total duration be?

Timing was my biggest challenge. The talk covers four pillars with customer references, architecture diagrams, and live demo explanations — all in 40 minutes. The original re:Invent version was 60 minutes. Cutting 20 minutes of content while keeping the narrative coherent meant every section had to be tight. I genuinely needed to increase my presentation speed while keeping clarity, and the skill helped me find where I was wasting time on over-explanation rather than where I needed to talk faster.

Example from my v6 analysis — the spending report explanation:

What I said (3.6 minutes): Step-by-step narration of both diagram versions — “it has a tool called get current date, which the agent can use to figure out… it would first plan… then invoke the tool… get the data… create a report… 4 LLM invocations… 12 seconds…”

What the skill proposed (1.5 minutes):

“Same agent, same question: ‘create a spending report for next month.’ Version one gives the agent a tool to figure out the current date — 4 LLM calls, 12 seconds. Version two just passes the current date as context — 3 calls, 9 seconds. Same result, 25% faster, cheaper, and one fewer failure point. The lesson: every piece of context you can provide deterministically is one less thing the agent can get wrong.”

That single rewrite saves 2 minutes. The skill found three such opportunities in one recording.

Step 6b: Speaking Pace Analysis

The pace detector — catching the “slow down without cutting” trap:

# Step 6b: Speaking Pace Analysis

Constraints:
- Calculate words-per-minute for each minute of the talk
  using word-level timestamps
- Calculate overall average WPM and flag minutes that are
  >20 WPM above or below average
- Compare first-half vs second-half average pace
  to detect deliberate slowdowns
- Calculate 5-minute rolling averages to identify pace phases
- Identify the pattern: if the speaker slowed down without
  cutting content, that's the time sink
- Report: "Your natural pace of ~170 wpm is fine.
  Use silent pauses between sections instead of
  slowing down mid-section."

This step caught a subtle mistake I was making: consciously slowing down in the second half to “not rush.” But slowing down without cutting content just makes the talk longer. My natural pace of ~170 WPM is fine for a technical audience. The fix is 2-3 second silent pauses between sections, not slower delivery within sections.

Step 7: Write Analysis Note

The output format — a versioned note with full progression tracking:

# Step 7: Write Analysis Note

Constraints:
- Write the analysis to the talk folder as
  aim201-dryrun-v{N}-analysis.md (increment version)
- Include:
    - Date, duration, verdict (🟢 good / 🟡 needs work / 🔴 regression)
    - Full progression table (all prior versions + current)
    - Filler word counts with trends
    - Customer reference scorecard
    - Pronunciation issues
    - Filler distribution table + worst windows + tighter alternatives
    - Timing analysis with time-saving proposals
    - Event readiness assessment

The output is a structured markdown note that I can review immediately after the dry run. The progression table is the most valuable artifact — it shows trends across all runs at a glance.

The Results: v1 → v6

Metric	v1	v2	v3	v4	v5	v6	Change
Duration	40:44	45:53	40:10	41:46	43:36	43:19	Stable
“um”	~80	64	59	25	32	27	-66%
“uh”	~10	21	12	7	16	15	Stable
“right?”	~25	24	8	10	18	11	-56%
“so” (starter)	~20	59	66	42	57	73	+265% 🔴
Customer refs	0/6	1/6	2/6	3/6	2/6	4/6	0 → 4
um+uh combined	~90	85	71	32	48	42	-53%

The “so” increase is the filler substitution effect — the most surprising finding. You cannot just suppress one filler. You need to replace it with silence. That requires deliberate practice of pausing, which is a different skill than “try not to say um.”

The AI also proposed specific exercises: practice the worst 2-minute windows in isolation, deliberately inserting 2-second pauses where fillers appeared. Record just those segments, not the full talk. This targeted practice is more effective than running the entire 40 minutes again — and it is something no human coach would have the data to suggest with this precision.

Eight Insights From Six Runs

These accumulated in the “Key Insights” section of the skill file — each one discovered through the data, not through intuition:

Filler substitution is real. Reducing “um” increases “so” or “uh.” Net filler load stays the same until you practice silent pauses specifically. This aligns with the academic recommendation to “render pauses silent instead of verbal” through content chunking [4].
Fatigue causes regression. The 5th run of the day regresses to v2 levels. Three runs per day maximum, with breaks between.
Fillers cluster where you enumerate. Storytelling sections are clean. Feature enumeration sections are filler-dense. The fix is structural (fewer items, stated as principles) not behavioral (try harder).
The anti-pattern: narrating bullet points. “Um” appears as a separator between items when you describe what the slide shows instead of stating the insight. Fix: point at slide, state principle, pause, move on.
Customer references need scripting. References that fit naturally into content flow work (Delivery Hero in cost section). Deliberate transition references need to be on speaker notes.
Pronunciation regresses with fatigue. Fresh = fine. Fifth run = “Enthropic” returns. Do not over-practice.
Over-explaining the slide is the #1 time sink. The audience can read. State the insight, not the diagram.
Slowing down adds time, it does not save it. Fewer words at natural pace (170 WPM) plus silent pauses between sections. Not slower delivery within sections.

The Cost

Component	Per Run	6 Runs
Amazon Transcribe	~$0.50	$3.00
LLM analysis (Bedrock)	~$0.10	$0.60
Total	$0.60	$3.60

For less than a coffee, you get quantified, comparable, actionable feedback on every practice run. No human rehearsal audience will give you filler counts bucketed by 5-minute intervals with proposed rewrites for the worst windows.

If You Are Running This on AWS

The setup is lightweight. I recorded and reviewed everything on my laptop. The heavy lifting — transcription and LLM inference — runs in the cloud. The split:

Local (your laptop):

Recording the dry run (any audio app, I used QuickTime)
Reading the analysis note
Iterating on the talk based on findings
Running the Kiro CLI agent that orchestrates everything

Cloud (AWS):

Amazon Transcribe — speech-to-text with word-level timestamps
S3 — upload recordings, store transcription results
Amazon Bedrock (Claude) — the analysis step: skill prompt + transcript + prior analyses as context

The skill file is ~120 lines of markdown. No code, no deployment, no infrastructure beyond what AWS provides out of the box. The agent reads the skill, follows the constraints, and produces the analysis note. Total latency: about 3 minutes from “analyze this recording” to reading the finished analysis.

Beyond Conference Talks

The pattern generalizes to any performance that can be recorded:

Customer presentations — did you hit all value propositions? Did you mention the next steps?
Sales calls — talk-to-listen ratio, question frequency, competitor mentions
Team standups — which topics expand? Who dominates airtime?
Training sessions — pace variation, jargon density, audience engagement markers

The key is always the same: define what to measure in a skill file, run it consistently, and track progression over time. The AI does not replace practice. It makes practice measurable.

The Skill Evolved With the Practice

One thing I want to emphasize: the skill you see in this post is not the skill I started with. Version 1 had three steps: transcribe, count “um”, check timing. That was it.

Every insight in the “Key Insights” section was discovered during a dry run and immediately encoded back into the skill. The pronunciation check was added after v2 revealed “strengths agents.” The filler distribution step was added after v3 when I wanted to know where the fillers were, not just how many. The speaking pace analysis was added after v4 when I noticed I was slowing down but the talk was getting longer.

The skill and the talk improved together. Each dry run made the talk better and made the skill better. By v6, the skill was catching things I would never have thought to check in v1. This co-evolution is the real pattern — you do not design the perfect analysis upfront. You discover what matters through practice.

What’s Next

Two things I want to try:

1. Record the actual stage performance. I analyzed 6 dry runs but not the real thing. The Summit talk itself — with audience energy, adrenaline, and the pressure of 300 people watching — would be the most valuable recording to analyze. How does stage performance compare to solo practice? Do the filler patterns hold? Does the timing improve (audience laughter creates natural pauses) or worsen (ad-libbing under pressure)? Next conference, I am recording the live version.

2. Apply this to customer meetings. I already record customer meetings (with consent) for meeting minutes. The recording exists. The context exists — I prepare meeting notes with objectives, talking points, and expected outcomes before every customer call. That is the same setup: context + recording + structured analysis. After the meeting, I could run the same pattern: did I cover all the points? What was my talk-to-listen ratio? Did I ask enough questions? Where did I over-explain? This is the natural extension — not just conference talks, but every important conversation where I want to improve.

Tools Under the Hood

The skill is a markdown file. It contains no code — only structured instructions. But it orchestrates real infrastructure through the agent’s built-in tools:

Step	Tool	What It Does
Duration check	`shell` → `ffprobe`	Measures recording length before uploading
S3 upload	`use_aws` → `s3 cp`	Uploads the .m4a to the transcription bucket
Start transcription	`use_aws` → `transcribe start-transcription-job`	Kicks off Amazon Transcribe with word-level timestamps
Poll for completion	`use_aws` → `transcribe get-transcription-job`	Checks status every 15s until complete
Download result	`use_aws` → `s3 cp`	Pulls the JSON transcript back to local
Analysis (Steps 2-6)	LLM reasoning	Pattern matching, counting, bucketing — all done by the model reading the transcript
Write note	`write`	Saves the structured analysis to the talk folder
Read prior runs	`read`	Loads previous analysis notes for comparison

What is use_aws? It is a built-in tool in Kiro CLI (Amazon’s agentic IDE). It wraps the AWS CLI, letting the agent make any AWS API call using your configured credentials (~/.aws/credentials, SSO, or IAM roles). It is not custom or specific to my setup — anyone with Kiro CLI installed has access to it. The agent calls aws transcribe start-transcription-job the same way you would from a terminal.

What is shell? Another Kiro CLI built-in. It executes terminal commands — in this case, ffprobe for measuring audio duration. Available to everyone.

What is read / write? Built-in file system tools. Read files, write files. The agent uses them to load prior analysis notes (for comparison) and save the new analysis.

The key insight: Steps 2-6 require no external tools at all. Once the transcript JSON is local, the entire analysis — filler counting, timestamp bucketing, checklist matching, pronunciation detection, timing proposals — is done by the LLM reading the data and following the skill’s constraints. The “intelligence” is in the skill file, not in custom scripts.

What you need to replicate this:

Kiro CLI (free)
AWS account with Amazon Transcribe access
An S3 bucket for recordings
ffprobe installed locally (comes with ffmpeg)
The skill file (120 lines of markdown)

That is it. No MCP servers, no custom code, no deployment pipeline.

Sources

[1] Amazon Transcribe documentation — word-level timestamps and speaker identification. https://docs.aws.amazon.com/transcribe/latest/dg/how-input.html

[2] Stefan Christoph, “The AI Track at AWS Summit Hamburg 2026: From Demo to Deployment,” schristoph.online, April 2026. https://schristoph.online/blog/ai-track-summit-hamburg-2026/ — The blog post that maps all customer sessions and informed the reference checklist.

[3] AWS Summit Hamburg 2026 — Session Catalog and Agenda. https://aws.amazon.com/events/summits/hamburg/agenda/

[4] Douglas R. Seals and McKinley E. Coppock, “We, um, have, like, a problem: excessive use of fillers in scientific speech,” Advances in Physiology Education, Vol. 46, No. 4, 2022. https://journals.physiology.org/doi/full/10.1152/advan.00110.2022 — Peer-reviewed paper on filler word causes and reduction strategies: self-awareness, reinforcing feedback, and “chunking” content to render pauses silent.

[5] Toastmasters International, “Prep Talk” — on the importance of practicing out loud because “your mind and mouth need to get used to working as partners.” https://www.toastmasters.org/magazine/magazine-issues/2019/oct/prep-talk

[6] Brigham Young University research (cited in Hyperbound, 2026): listeners perceive speakers who use fewer filler words as more credible and authoritative.

[7] University of Michigan study (cited in Resumly, 2026): interviewers notice filler words in 71% of candidates and interpret them as lack of preparation.

[8] Stefan Christoph, “From demo to deployment: solving agentic AI’s toughest challenges,” AWS Summit Hamburg 2026 (AIM201). The talk this skill was built to prepare.

[9] Stefan Christoph, “Intelligence Is Collective, Not Artificial,” schristoph.online, May 2026. https://schristoph.online/blog/intelligence-is-collective-not-artificial/

[10] Stefan Christoph, “Architecting Skills: How Code Makes AI Agents More Reliable Over Time,” schristoph.online (upcoming). https://schristoph.online/blog/architecting-skills-reliability/

About the Author

Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.

This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.

Learn more →

Cross-posted to LinkedIn

❤️ Created with the support of AI (Kiro)