Your AI Models Have an Expiry Date — A Practical Guide to Model Lifecycle Management

March 16, 2026

Introduction — The Promise I Made

In my previous article [1], I explored the maintenance trap in IT — how software systems are more like plants than stones, requiring constant care. I ended with a cliffhanger: “What is open from the article is how to specifically test and evaluate models — something to be picked up in the next article.”

This is that article.

Since publishing the first piece, something happened that made this topic very real for many of my customers. Anthropic announced the deprecation of Claude 3.5 Sonnet — a model that had become the backbone of countless production applications. Teams that had built their systems around a specific model version suddenly faced a hard deadline to migrate. Some were prepared. Most were not.

The reactions I observed ranged from mild annoyance to genuine panic. And this is just the beginning. As AI models become critical infrastructure components, the question is no longer if your model will be deprecated, but when — and whether you’ll be ready.

The Lifecycle Compression Problem

Let’s put numbers on this. Traditional enterprise software operates on generous support timelines. Microsoft provides five years of mainstream support followed by five years of extended support. Cisco gives six months advance notice before End of Sale dates. Server hardware follows five to seven-year lifecycle patterns.

AI models? A completely different story.

Amazon Bedrock guarantees a minimum of 12 months from launch to End-of-Life. Azure OpenAI provides 12 months of Generally Available support plus six months of extended access. Google Cloud sets retirement dates with new access blocked one month before. OpenAI retired its first multimodal model in February 2026 with minimal notice.

We went from a decade to a year. That’s a 6-10x compression in lifecycle duration.

And it’s not just one model. Modern applications typically use three to five different AI models [2], each on its own deprecation timeline. The combinatorial complexity of managing multiple model lifecycles simultaneously is something most organizations haven’t even begun to think about.

Understanding the Lifecycle — Bedrock as an Example

Amazon Bedrock provides a structured lifecycle that serves as a good reference model for how managed AI services handle deprecation. Every model on Bedrock exists in one of three states:

Active — The model provider is actively working on this version. Full access to inference, fine-tuning, and provisioned throughput.

Legacy — The model will be retired. It remains available for at least six months before the EOL date. You can continue using existing deployments, but you cannot create new fine-tuning jobs or new provisioned throughput endpoints.

End-of-Life — The model is gone. API calls return errors.

The key policy: once a model launches on Bedrock, it will remain available for at least 12 months before the EOL date [3]. This is your planning horizon.

What makes this manageable is that Bedrock exposes the lifecycle programmatically. The GetFoundationModel and ListFoundationModels APIs return a modelLifecycle field with the current state. You can — and should — build automated monitoring around this. An EventBridge rule that alerts your team when a model you depend on transitions to Legacy gives you a six-month head start on migration planning.

The Architecture That Saves You — Abstraction Layers

Model abstraction layer architecture — Abstraction layers decouple your application from specific model versions

The single most important architectural decision for model lifecycle management is one you need to make before you deploy your first model to production: decouple your business logic from specific models.

AWS Prescriptive Guidance puts it clearly: “Use model selection abstraction layers — decouple business logic from specific models to enable dynamic routing, fallback, or cost-performance tuning over time.” [4]

In practice, this means three things:

1. Use the Converse API

If you’re on Amazon Bedrock, the Converse API [5] provides a consistent interface across all models. Your application code talks to one API. Switching models means changing a model ID string — not rewriting your integration layer. This is the minimum viable abstraction.

2. Build Model Fallback Chains

Model fallback chains are backup models that activate automatically when primary models fail or become unavailable. The mechanism detects errors, rate limits, or outages and routes requests to pre-configured alternatives [6]. Think of it as a load balancer for intelligence — if Model A returns a 404 because it’s been retired, Model B takes over seamlessly.

Salesforce’s Agentforce platform demonstrates what mature failover looks like: gateway-level failovers, soft failovers, and circuit breakers in a multi-layered design [7]. You don’t need to be Salesforce to implement this pattern, but you do need to think about it before your primary model disappears.

3. Leverage Intelligent Prompt Routing

Amazon Bedrock provides a native routing capability that dynamically routes requests between models within the same family based on query complexity [8]. It predicts response quality per request and routes to the best model — optimizing for both quality and cost. The service can reduce costs by up to 30% without compromising accuracy.

The lifecycle benefit? When a new model version becomes available, the router can incorporate it automatically. When an old version approaches EOL, traffic naturally shifts away. This is model lifecycle management happening at the infrastructure level, without application code changes.

The Evaluation Imperative — Going Beyond Vibes

Here’s where most teams fall short. They have an abstraction layer. They know a model is being deprecated. They even have a candidate replacement model identified. But they don’t have a systematic way to verify that the replacement actually works for their use case.

When Andrej Karpathy coined the term “vibe coding” in February 2025 — describing a workflow where you “fully give in to the vibes and forget that the code even exists” — he captured something real about how we interact with AI. The AWS Public Sector team applied the same concept to evaluation, calling it “vibe testing” — validating AI solutions based on emotional reactions to model outputs instead of quantitative criteria [9]. “It seems to work fine” is not an evaluation strategy. It’s a prayer.

The fix is to establish evaluation baselines before you need to migrate. When a model transitions to Legacy, you should already have the data to compare against.

Building Your Evaluation Harness

Amazon Bedrock Evaluations provides three complementary approaches:

Programmatic evaluations compute scores using built-in metrics — accuracy, robustness, toxicity. Run the same evaluation dataset against your current model and the candidate replacement. Compare scores. This is your first gate.

LLM-as-a-Judge uses a second model to score the generator model’s responses on metrics like correctness, completeness, helpfulness, and responsible AI criteria. This delivers human-like evaluation quality with up to 98% cost savings compared to human evaluation [10]. You can define custom metrics for your specific business requirements — brand voice alignment, domain accuracy, regulatory compliance.

Human evaluations bring subject-matter experts into the loop for high-stakes use cases where automated metrics aren’t sufficient. Healthcare, legal, financial services — anywhere the cost of a wrong answer exceeds the cost of human review.

The critical feature for lifecycle management: Bring Your Own Inference (BYOI). You can evaluate models running anywhere — on Bedrock, other clouds, or on-premises [11]. This means your evaluation harness isn’t locked to a single provider’s ecosystem.

The Evaluation Workflow for Model Migration

Six-step model migration workflow — The evaluation workflow for model migration: Baseline → Candidate → Compare → Shadow → Canary → Rollout

Based on the patterns I’ve seen work in practice, here’s the workflow:

Baseline — Run your evaluation suite against the current production model. Programmatic metrics, LLM-as-a-Judge, and domain-specific custom metrics. Store the results. This is your reference point.
Candidate — Run the exact same suite against the replacement model. Same dataset, same metrics, same guardrail configuration. No shortcuts.
Compare — Side-by-side analysis. Bedrock Evaluations lets you view multiple evaluation jobs together. Look for regressions, improvements, and edge cases.
Shadow — Deploy the candidate in shadow mode against production traffic. Log responses without serving them to users. Compare quality, latency, and cost in real-world conditions.
Canary — Route a small percentage of traffic to the candidate. Monitor for regressions that didn’t surface in evaluation.
Rollout — Full migration with rollback capability. Keep the old model available as a fallback during the transition period.

This isn’t theoretical. The AWS blog on migrating from Claude 3.5 Sonnet to Claude 4 Sonnet [12] walks through exactly this pattern — including prompt engineering adjustments, new capability enablement, and phased deployment strategies.

Evaluating Agents — A Harder Problem

For agentic AI systems, evaluation gets more complex. You’re not just testing a model’s response to a prompt — you’re testing a chain of reasoning, tool invocations, and multi-step workflows.

Two approaches stand out:

Ragas with LLM-as-a-Judge [13] provides an open-source framework for evaluating Bedrock Agents on RAG, text-to-SQL, and chain-of-thought capabilities. It evaluates the full agent trajectory, not just the final output.

Lightweight evals without a framework — Gunnar Grosch published an excellent Builder article [14] with practical evaluation patterns that don’t require a full eval framework. It comes with a demo repository you can fork and adapt. The key insight: you don’t need a massive evaluation infrastructure to start. You need a curated set of test cases and a systematic way to run them.

SageMaker — When You Own the Model

Everything above applies to foundation models consumed via APIs. But many organizations also train, fine-tune, and deploy their own models. This is where Amazon SageMaker’s lifecycle management tooling comes in.

SageMaker Model Registry provides a central catalog for model versions — think of it as a package manager for ML models. Each model group contains versioned model packages with metadata, approval status, and lineage tracking. The staging construct lets you define custom lifecycle stages (Development → Testing → Production) with IAM-enforced permissions at each transition [15].

The power is in the automation. EventBridge integration means you can trigger deployment pipelines when a model moves to “Approved” status. Model Monitor detects data drift and concept drift in real-time, alerting you when a model’s performance degrades — whether from changing data patterns or because a newer model version would perform better.

For organizations using both Bedrock and SageMaker — which is increasingly common — the SageMaker Model Registry can track which Bedrock foundation model version is used in each deployment, creating a unified lifecycle view across provider-owned and customer-owned models.

The Governance Dimension

Model lifecycle management doesn’t exist in a vacuum. It sits within a broader governance framework that’s becoming a regulatory requirement, not just a best practice.

The EU AI Act entered into force on 1 August 2024, with obligations phasing in over 36 months. The timeline matters for model lifecycle management specifically:

February 2025: Prohibited AI practices banned (social scoring, manipulative AI). Violations: up to €35 million or 7% of global annual revenue.
August 2025: General-purpose AI (GPAI) model provider obligations apply — technical documentation, training data summaries, copyright compliance. Violations: up to €15 million or 3% of global turnover.
August 2026: High-risk AI system obligations fully apply — risk management, data governance, human oversight, and critically: automatic event logging (Article 12). Violations: up to €15 million or 3% of global turnover.

Article 12 is the one that directly connects to model lifecycle management. It mandates automatic event recording for high-risk AI systems, requiring logs to be retained for at least 180 days and enabling reconstruction of decisions and system behavior. In practice, this means you need to know which model version produced which output, when it was evaluated, and who approved the transition. This isn’t a best practice — it’s a legal requirement for high-risk systems operating in the EU.

Beyond the EU, the Colorado AI Act takes effect in July 2025, and ISO/IEC 42001 provides the first certifiable AI management system standard [17] — giving organizations a structured framework to demonstrate compliance.

Despite this regulatory momentum, only about 24% of organizations have AI governance programs in place [18]. The gap between adoption and governance is staggering — and it’s a gap that model lifecycle management directly helps close, because the evaluation baselines, migration records, and version tracking you build for operational reasons double as the audit trail regulators will ask for.

SageMaker Model Cards provide immutable records of intended model uses, risk ratings, training details, and evaluation results. Bedrock’s logging and monitoring capabilities feed into CloudWatch and CloudTrail. Together, they create the documentation trail that satisfies both operational and compliance requirements.

Hands-On: Try It Yourself

Theory is great, but builders want to get their hands dirty. Here are three concrete paths you can follow today, from simplest to most comprehensive.

Path 1: Compare Two Models in 15 Minutes (Bedrock Console)

This is the fastest way to experience model evaluation. No code required.

Step 1 — Prepare your dataset. Create a JSONL file with prompts that represent your use case. Each line is a JSON object with a prompt key. You can have up to 1,000 prompts per evaluation job. Start with 20-50 that cover your critical scenarios.

{"prompt": "Summarize the key risks of deploying AI in healthcare."}
{"prompt": "Write a Python function that validates an email address."}
{"prompt": "Explain the difference between RAG and fine-tuning to a non-technical stakeholder."}

If you’re evaluating for a model migration, use real prompts from your production system — not synthetic ones.

Step 2 — Upload to S3. Put your .jsonl file in an S3 bucket in the same region as your Bedrock models. Enable CORS on the bucket (required for console-based evaluations) [19].

Step 3 — Create the evaluation job. Open the Bedrock console → Evaluations → Create evaluation job. Select “LLM-as-a-Judge” as the evaluation type. Pick your current production model as the generator. Select an evaluator model (Amazon Nova Pro or Claude work well). Choose metrics — start with correctness, completeness, and helpfulness.

Step 4 — Run a second job with the candidate model. Same dataset, same evaluator, same metrics — just swap the generator model to the one you’re considering migrating to.

Step 5 — Compare. Use the Bedrock Evaluations compare feature to view both jobs side by side. Look at the histogram of scores and read the evaluator’s explanations for the first few prompts. You’ll immediately see where the candidate model improves, matches, or regresses.

That’s it. You now have a quantitative comparison between two models on your actual use case. No framework, no infrastructure, no code.

Path 2: Automate Evaluation with Python SDK

When you need to run evaluations programmatically — for example, as part of a CI/CD pipeline that triggers when a model transitions to Legacy — use the Bedrock Python SDK.

import boto3
import json

bedrock = boto3.client("bedrock", region_name="us-east-1")

# Create an LLM-as-a-Judge evaluation job
response = bedrock.create_evaluation_job(
    jobName="claude-migration-eval-2026-03",
    roleArn="arn:aws:iam::123456789012:role/BedrockEvalRole",
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [
                {
                    "taskType": "General",
                    "dataset": {
                        "name": "my-production-prompts",
                        "datasetLocation": {
                            "s3Uri": "s3://my-eval-bucket/prompts.jsonl"
                        }
                    },
                    "metricNames": [
                        "Builtin.Accuracy",
                        "Builtin.Robustness"
                    ]
                }
            ]
        }
    },
    inferenceConfig={
        "models": [
            {
                "bedrockModel": {
                    "modelIdentifier": "anthropic.claude-sonnet-4-5-20250929-v1:0",
                    "inferenceParams": "{\"inferenceConfig\":{\"maxTokens\":512,\"temperature\":0.0}}"
                }
            }
        ]
    },
    outputDataConfig={
        "s3Uri": "s3://my-eval-bucket/results/"
    }
)

eval_job_arn = response["jobArn"]
print(f"Evaluation job started: {eval_job_arn}")

Run this for both your current and candidate model, then compare the results in S3. You can parse the output JSONL to compute aggregate scores and flag regressions automatically.

Tip: Wrap this in a Lambda function triggered by an EventBridge rule that fires when GetFoundationModel returns modelLifecycle: LEGACY for a model you depend on. Fully automated migration readiness assessment.

Path 3: Full Lifecycle with SageMaker Model Registry

For organizations that train or fine-tune their own models, SageMaker provides the complete lifecycle management infrastructure.

Step 1 — Create a Model Group. This is your container for all versions of a model solving a specific problem.

import boto3

sm = boto3.client("sagemaker")

response = sm.create_model_package_group(
    ModelPackageGroupName="my-text-classifier",
    ModelPackageGroupDescription="Production text classifier — all versions"
)

Step 2 — Register model versions. Each trained model becomes a versioned package in the group.

response = sm.create_model_package(
    ModelPackageGroupName="my-text-classifier",
    ModelPackageDescription="v3 — retrained on Q1 2026 data",
    InferenceSpecification={
        "Containers": [{
            "Image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-model:v3",
            "ModelDataUrl": "s3://my-models/v3/model.tar.gz"
        }],
        "SupportedContentTypes": ["application/json"],
        "SupportedResponseMIMETypes": ["application/json"]
    },
    ModelApprovalStatus="PendingManualApproval"
)

Step 3 — Set up the staging construct. Define lifecycle stages with permissions — only ML engineers can move models to Testing, only the ML lead can approve for Production.

response = sm.update_model_package(
    ModelPackageArn=model_package_arn,
    ModelLifeCycle={
        "Stage": "Development",
        "StageStatus": "InProgress",
        "StageDescription": "Initial training complete, pending evaluation"
    }
)

Step 4 — Automate with EventBridge. When a model’s stage changes, trigger your evaluation pipeline, deployment workflow, or notification.

# EventBridge rule pattern for lifecycle transitions
{
    "source": ["aws.sagemaker"],
    "detail-type": ["SageMaker Model Package State Change"],
    "detail": {
        "ModelLifeCycle": {
            "Stage": ["Production"],
            "StageStatus": ["Approved"]
        }
    }
}

Step 5 — Monitor in production. SageMaker Model Monitor detects data drift and quality degradation. When it fires an alert, that’s your signal to evaluate whether a newer model version or a different foundation model would perform better.

Which Path Should You Start With?

Your situation	Start with
Using Bedrock foundation models, no custom training	Path 1 (console) → Path 2 (automate)
Training/fine-tuning your own models	Path 3 (SageMaker)
Using both Bedrock and custom models	Path 2 + Path 3 combined
Just want to understand the concept	Path 1 — takes 15 minutes

The key insight: start with Path 1 today, even if you think you need Path 3 eventually. Having any evaluation baseline is infinitely better than having none when the deprecation email arrives.

A Practical Checklist

If you take away one thing from this article, let it be this checklist:

Use the Converse API (or equivalent abstraction) — never hardcode model-specific integrations
Monitor model lifecycle states — automate alerts when models you depend on transition to Legacy
Build evaluation baselines now — don’t wait for a deprecation announcement
Implement fallback chains — at minimum, configure a secondary model for every primary model
Version-control your prompts — prompts are code, treat them accordingly
Run shadow tests before migrating — production traffic reveals what evaluation datasets miss
Document everything — model versions, evaluation results, migration decisions. Your future auditor will thank you.

Conclusion — The New Normal

AI model lifecycle management is not a one-time project. It’s a continuous discipline, like security or performance optimization. The models you deploy today will be deprecated. The question is whether you’ve built systems that can evolve gracefully or systems that will break suddenly.

The good news: the tooling is maturing fast. Bedrock’s lifecycle states, Converse API, and Intelligent Prompt Routing provide the infrastructure layer. SageMaker’s Model Registry and staging constructs provide the governance layer. Bedrock Evaluations provides the quality assurance layer. The pieces are there.

The challenge is organizational, not technical. It requires treating model lifecycle management as a first-class concern from day one — not as an afterthought when the deprecation email arrives.

As I wrote in the first article: software systems are more like plants than stones. AI models? They’re more like cut flowers. Beautiful, powerful, and temporary. Plan accordingly.

Stones, plants, and cut flowers metaphor — Stones need no maintenance. Plants thrive with care. Cut flowers are beautiful but temporary.

What’s your approach to model lifecycle management? Are you already running automated evaluations, or still in “vibe testing” mode? I’d love to hear your experiences — drop a comment or reach out directly.