From a Generic Voice to My Own: Self-Hosting a TTS Model on Amazon SageMaker

written by Stefan Christoph

June 10, 2026 - 14 minutes read

TL;DR: Last time, the demo video for my agentic-payments post was narrated by Amazon Polly: a clean, managed, recognizably synthetic voice. This time the same demo is narrated in my own voice, cloned from a 30-second recording by an open-weights Qwen3-TTS model I deployed myself on Amazon SageMaker. The post walks through the async, scale-to-zero endpoint that hosts it, the real deploy and invoke code, the one gotcha that cost me 20 minutes (the autoscaler will not wake a scaled-to-zero endpoint on a single queued request unless you add a second policy), and an honest look at the cost trade-off versus Polly and the ethics of cloning a voice, even your own.

Quick disclaimer: I’m a solutions architect who builds things to understand them. This is a builder’s field report, not authoritative guidance, and definitely not legal or ethics advice on synthetic voice. If I’ve got something wrong, tell me.

The Voice Was the Tell

A couple of weeks ago I built a research agent that pays for content on its own. The post did well, but a few people made the same offhand comment about the demo video: the narration sounded like a robot. They were right. It was Amazon Polly, the Matthew neural voice. Amazon Polly is genuinely good, but a managed TTS gives you a fixed set of voices, and a fixed voice is never your voice.

So I asked a different question: what would it take to narrate the exact same demo in my own voice, without sending my voice to anyone else’s API? The answer turned into a small project. I deployed an open-weights text-to-speech model on Amazon SageMaker, cloned my voice from a 30-second clip, and re-rendered the demo. Same pixels, same walkthrough, different narrator.

This post is the build log: the architecture, the code that actually runs, what surprised me, and the part nobody likes to skip over: whether cloning a voice is something you should do at all.

The Model: Qwen3-TTS, Open Weights You Control

The model is Qwen3-TTS, an Apache-2.0 release from Alibaba’s Qwen team [1]. I picked it for one reason: the weights are open, so I can run it on my own AWS account and feed it my own voice without a third party in the loop. Two checkpoints do the work:

Qwen3-TTS-12Hz-1.7B-CustomVoice for named speakers with instruct-style control (“calm, professional”).
Qwen3-TTS-12Hz-1.7B-Base for cloning a voice from a short reference clip.

Both are small (1.7B parameters, bfloat16) and fit together on a single 24 GB GPU, which keeps the hosting decision simple: one instance, one endpoint, both models loaded.

Deploying It on Amazon SageMaker Async Inference

First, the two services in play. Amazon SageMaker AI is AWS’s fully managed machine-learning service for building, training, and deploying models on managed, scalable infrastructure without running your own servers (what it is). Deploying a model on it gives you an endpoint: an HTTPS API that serves your model on instances Amazon SageMaker AI provisions and scales for you. It offers several endpoint types — real-time (always-on, synchronous, lowest latency), serverless (CPU-only, pay-per-request), batch transform (offline bulk jobs), and the one I used here, asynchronous inference.

Here is the choice that shaped everything: asynchronous inference, not a real-time endpoint.

Amazon SageMaker async inference queues incoming requests and processes them by reading the payload from Amazon S3 and writing the result back to S3 [2]. AWS positions it for large payloads (up to 1 GB), long processing times (up to one hour), and near-real-time rather than interactive latency [2]. The feature I actually cared about: an async endpoint can autoscale its instance count down to zero when idle, so you pay nothing for compute between runs [2].

That fits voice generation for a demo almost perfectly. I render a video maybe once a week. Between renders, a GPU instance sitting idle would cost real money. With async and min-zero scaling, the endpoint is defined and ready, costs nothing while I’m not using it, and spins an instance up when a request lands.

Amazon SageMaker async inference architecture: client uploads input JSON to S3, calls InvokeEndpointAsync, the endpoint reads the payload and writes the wav result back to S3, the client polls for it — The async flow: S3 in, S3 out, and a GPU endpoint that scales to zero between requests.

The endpoint configuration is what makes it async. You attach an AsyncInferenceConfig that names an S3 output path (and a failure path) when you create the endpoint:

From sagemaker/deploy.py — the deploy that makes the endpoint asynchronous:

from sagemaker.huggingface import HuggingFaceModel
from sagemaker.async_inference import AsyncInferenceConfig

model = HuggingFaceModel(
    model_data=model_data,          # model.tar.gz with code/inference.py
    role=role,
    transformers_version="4.49.0",
    pytorch_version="2.6.0",
    py_version="py312",
    env=env,
    sagemaker_session=sess,
)

async_cfg = AsyncInferenceConfig(
    output_path=f"s3://{bucket}/output/",
    failure_path=f"s3://{bucket}/error/",
    max_concurrent_invocations_per_instance=2,
)

model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",   # 1x NVIDIA A10G, 24 GB
    endpoint_name="qwen3-tts-async",
    async_inference_config=async_cfg,
    wait=False,
)

The presence of an AsyncInferenceConfig on the endpoint configuration is the switch: an endpoint with it set only accepts asynchronous invocations [2]. To get the scale-to-zero behavior, you register the endpoint variant with Application Auto Scaling and set MinCapacity=0, then attach a target-tracking policy on the ApproximateBacklogSizePerInstance metric, which is the metric AWS recommends for async endpoints [3]:

From sagemaker/deploy.py — registering scale-to-zero:

aas = boto3.client("application-autoscaling", region_name=region)
rid = f"endpoint/{name}/variant/AllTraffic"

aas.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=rid,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,          # <-- scale all the way down to zero
    MaxCapacity=1,
)

aas.put_scaling_policy(
    PolicyName="qwen3-tts-backlog-scaling",
    ServiceNamespace="sagemaker",
    ResourceId=rid,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 2.0,
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 300,
        "ScaleOutCooldown": 60,
    },
)

The handler that runs inside the container loads both models once at startup and routes each request on a mode field. The clone path is a single call:

From sagemaker/inference.py — the voice-clone branch of predict_fn:

if mode == "voice_clone":
    ref_path = _materialize_ref_audio(data["ref_audio"])   # b64, s3://, or https
    wavs, sr = models["base"].generate_voice_clone(
        text,
        language=language,
        ref_audio=ref_path,
        ref_text=data.get("ref_text", ""),
    )

That ref_text argument matters more than it looks, and it’s where I lost the most time. More on that below.

What Async Inference Actually Costs You

Scale-to-zero is not free in every sense. It trades money for latency, and it’s worth being honest about both sides before you copy this pattern.

Dimension	Async, scale-to-zero	Real-time endpoint	Serverless inference
Idle cost	Zero when scaled to zero	Pay per hour, always on	Zero (pay per request)
Cold start	Seconds to minutes (instance launch + container + model load)	None while warm	Seconds; no GPU option
Large payloads / long jobs	Up to 1 GB, up to ~1 hour, via S3 [2]	Capped (6 MB sync invoke)	Smaller limits, short timeouts
Request handling	Queued in S3, processed when capacity exists [2]	Synchronous, immediate	Synchronous, immediate
GPU support	Yes	Yes	No (CPU only today)

For my workload (bursty, GPU-bound, latency-tolerant, runs a few times a week) async wins clearly. For an interactive product where a user is waiting on each utterance, the cold start would be unacceptable and a real-time endpoint or a managed service is the better call. Serverless inference would be attractive for the zero-idle-cost property, but it has no GPU option today, and a 1.7B TTS model on CPU is too slow to be useful.

The Gotcha: A Scaled-to-Zero Endpoint Did Not Wake Up

Here is the part the docs warned about and I learned the hard way. With my endpoint scaled to zero, I fired the first synthesis request and watched it sit. Queued. Twenty minutes later my client gave up with a timeout, and the instance count was still zero.

The cause is documented behavior, not a bug. The target-tracking policy I attached scales on ApproximateBacklogSizePerInstance with a target value of 2. A single queued request produces a backlog of 1, which never exceeds the target, so the policy had no reason to scale up from zero. The Amazon SageMaker docs say this directly: if you don’t add the optional step-scaling policy that scales up from zero on new requests, “your endpoint only initiates scaling up from zero after the number of backlog requests exceeds the target tracking value” [3].

In other words: backlog-target scaling alone is enough to scale down to zero, but to reliably wake up from zero on the very first request you need a second policy. I forced the instance up by hand (setting the desired count to 1) to finish the render, and the synthesis completed in about three minutes once the instance was live. The fix for a real deployment is the optional scale-up-from-zero policy the docs describe. Worth knowing before you build a scale-to-zero endpoint and assume the first caller will wake it.

[!info] Last verified Amazon SageMaker async inference behavior (S3 queueing, 1 GB / 1-hour limits, scale-to-zero) and the scale-up-from-zero autoscaling caveat checked against the Asynchronous inference and Autoscale an asynchronous endpoint AWS docs on 2026-06-09.

Self-Hosting vs a Managed Voice: An Honest Trade

Replacing Amazon Polly with a model I host myself sounds like a clear win until you list what Amazon Polly was quietly doing for me.

	Self-hosted Qwen3-TTS	Amazon Polly
Voice control	Any voice, including a clone of mine	Fixed catalog of voices
Data path	Voice stays in my account	Text (and reference audio, for some features) goes to a managed API
Ops burden	I own the endpoint, the GPU, the scaling policy, the cold start	None; fully managed
Latency	Cold start unless kept warm	Low, consistent
Cost shape	GPU-hour while active, zero when idle	Per character, no idle cost, no floor
Failure modes	Mine to debug (see the gotcha above)	AWS’s to operate

Amazon Polly is the right default for most narration. The moment self-hosting earns its keep is when you need something the managed service does not offer, and “narrate this in my own voice” is exactly that. If all I wanted was a pleasant generic voice, I would have stayed on Amazon Polly and saved myself a GPU endpoint and an afternoon of debugging.

The Demo, in My Voice

Here is the same demo from the first post, re-narrated in my cloned voice:

The agentic-payments demo, re-narrated in my own voice cloned by a self-hosted Qwen3-TTS model. The walkthrough is identical to the first post; only the narrator changed.

The interesting part is not the model. It’s the pipeline that puts the voice back onto the screen recording without the narration drifting out of sync.

Demo recording pipeline: a completion-gated Playwright capture produces the screen video and scene boundaries, the cloned voice synthesizes seven narration segments, each is measured, then scenes are reconciled and stitched with audio laid back-to-back — The recording pipeline: capture once, measure the audio, then reconcile each scene to its narration so the voice can’t drift.

The naive approach (record the screen, then overlay narration with fixed delays) drifts, because a cloned voice produces segments of different lengths than the voice the recording was timed for, and the error accumulates across segments. The pipeline I settled on inverts the problem: it captures the demo completion-gated (waiting for the UI to settle, not for a fixed clock), measures each narration segment’s real duration, then re-times each scene to match its segment and lays the audio back-to-back. Every segment’s audio starts exactly at its scene boundary, so the narration cannot drift no matter which voice I use.

That last property is the point. The recording stack is voice-agnostic. Swapping Amazon Polly for “Aiden” for my cloned voice is a one-flag change, and the sync holds every time. That is what makes “re-narrate the same demo in a different voice” a 3-minute job instead of a re-edit.

Should You Clone a Voice At All?

This is the part I want to be careful about, because voice cloning is genuinely double-edged and pretending otherwise would be dishonest.

The case for it, here

I cloned my own voice, from my own recording, with my own consent, to narrate my own demo. That is about as clean as the consent question gets. The voice stays in my AWS account; nothing is sent to a third party. The output is labelled as a synthetic narration of my voice, not passed off as a live recording.

There’s also a point worth being plain about: my voice is already public. I’ve given recorded talks that live on the internet as video and audio, and anyone with those clips and a capability like this one could already clone it. Publishing this demo adds no exposure that wasn’t there the moment I spoke on a recorded stage. That doesn’t make voice cloning harmless in general — it just means that, in my specific case, the incremental risk of this particular post is essentially zero. If anything, the realistic threat model for most people with any public recording is the same: the raw material is already out there.

The case against it, in general

The same 30-second-clip capability that lets me narrate a demo lets someone clone a voice they have no right to. Voice is biometric and personal. A convincing clone can be used for fraud, for putting words in someone’s mouth, for impersonation that bypasses voice-based authentication. The barrier to entry is now a short audio clip and a GPU, and that is a real societal problem, not a hypothetical one.

I don’t think “the tech exists, so it’s fine” is a serious position, and neither is “it can be misused, so nobody should touch it.” The line I’m comfortable with, and the one I’d argue for:

Consent is the dividing line. Cloning your own voice, or a voice you have explicit permission to use, is categorically different from cloning someone else’s.
Disclose synthetic audio. If a voice is machine-generated, say so. This post does; the video is described as a cloned-voice narration, not a recording of me at a microphone.
Keep the data in your control. Self-hosting means my voice sample never leaves my account. That doesn’t solve the societal problem, but it does mean I’m not handing my biometric to anyone.

None of that makes the misuse risk go away. It’s the reason consent, provenance, and disclosure matter more as the technology gets easier, not less. I built this in the open, on my own voice, and I’m telling you exactly how it works, because the alternative (pretending the capability isn’t widely available) helps no one.

The Cost Reality

For a demo workload, this is close to free:

Component	Cost
`ml.g5.xlarge` (A10G 24 GB), us-east-1	~$1.41 / hour, billed per second while processing
Idle compute (scaled to zero)	$0
S3 async input/output	Negligible (KB-MB per request)
Model weights	$0 (Apache-2.0, no token, pulled at container start)

A full demo render is a handful of short synthesis calls, a few minutes of GPU time, so it costs cents. The endpoint sits scaled-to-zero the rest of the time and costs nothing for compute. The trade is the cold start: the price of paying zero while idle is waiting for an instance (and two model loads) when you come back.

Scope: the hourly rate is a planning figure from the AWS Price List API for Amazon SageMaker ml.g5.xlarge in us-east-1 (effective 2026-05-01). Confirm against the Amazon SageMaker pricing page for your region before you build a budget on it.

If You’re Running This on AWS

The whole thing is four moving parts:

Service	Role	Why
Amazon SageMaker (async inference)	Hosts the model, queues via S3, scales to zero	Pay only while synthesizing; GPU when you need it
Amazon S3	Async request/response transport	Handles payloads the 6 MB sync invoke limit can’t
Application Auto Scaling	`MinCapacity=0` + backlog target tracking	Scale-to-zero (plus a scale-up-from-zero policy, per the gotcha)
ml.g5.xlarge (A10G 24 GB)	The compute	Fits both 1.7B models in bf16 on one GPU

Deploy is a quota precheck, a role and bucket, a packaged model.tar.gz, and the AsyncInferenceConfig shown above. The endpoint is IAM-authenticated and invoked through the Amazon SageMaker API; there’s no public URL in front of it, and the S3 I/O bucket has public access blocked. Tag everything for cost attribution and tear the endpoint down (or leave it scaled to zero) when you’re done.

Where This Goes Next

I now have a voice-agnostic demo-recording stack: capture a walkthrough once, narrate it in any voice (managed or cloned), and the audio stays locked to the action. That’s a small piece of infrastructure, but it turns “make a narrated demo” from a half-day edit into a script.

If there’s interest, I’d like to write up the recording pipeline itself as its own piece, the completion-gated capture, the measure-and-reconcile sync, and the voice-parameterized renderer, because the sync problem is more broadly useful than TTS. If that’s something you’d want to read or use, tell me in the comments and I’ll prioritize it.

This builds on Part 2 of the Agentic Commerce series, I Built the Agent That Pays, whose demo this re-narrates. Part 1: HTTP 402: The 30-Year Placeholder That AI Agents Finally Activated.

Sources

[1] Qwen3-TTS — open-weights (Apache-2.0) text-to-speech model family from Alibaba’s Qwen team. https://github.com/QwenLM/Qwen3-TTS

[2] Asynchronous inference — Amazon SageMaker AI Developer Guide (S3 queueing, 1 GB payload / 1-hour limits, scale-to-zero, AsyncInferenceConfig). https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html

[3] Autoscale an asynchronous endpoint — Amazon SageMaker AI Developer Guide (MinCapacity=0, ApproximateBacklogSizePerInstance target tracking, and the scale-up-from-zero caveat). https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-autoscale.html

About the Author

Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.

This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.

Learn more →

Cross-posted to LinkedIn

❤️ Created with the support of AI (Kiro)