A Model Deprecation Validated My Platform Design

I built an invoice extraction platform on AWS — not a proof of concept, but a working MVP with a four-zone data lake, event-driven orchestration, failure handling, and an analytics layer.

Then I used it to answer a question I care about as an engineer: can a foundation model compete with a purpose-built extraction service when both are measured against the same contract, the same corpus, and the same operational standards?

The original challenger was Claude 3 Haiku via Bedrock. It showed promise. Then Anthropic announced the model was being deprecated on Bedrock. What felt like a setback turned into the most useful validation of the platform’s architecture: the champion/challenger lane absorbed a full model swap without a rewrite.

The replacement — Amazon Nova Lite — didn’t just keep pace. It was 28× cheaper and 6× faster than the Textract baseline.

No vibes. No slides. Just data.

The platform

Before getting into the benchmark, the platform matters. The comparison is only credible because both extraction paths run through the same architecture.

Common flow: S3 (ingest) → SQS → Step Functions → Extraction → DynamoDB → Glue → Athena

Textract extraction: S3 → Step Functions → Lambda → Textract (AnalyzeExpense) → DynamoDB → … Purpose-built for structured document extraction.

Challenger extraction: S3 → Lambda → Bedrock (Nova Lite) → DynamoDB → … A foundation model handling the extraction directly, producing the same target JSON through a reusable challenger lane.

Same input. Same canonical JSON contract. Same downstream pipeline. The only variable is the extraction engine.

What’s underneath

S3 data lake with four zones: raw, processed, curated, quarantine
SQS for ingestion buffering with a DLQ for backpressure
Step Functions orchestrating the extraction workflow with explicit failure classification — transient failures retry, permanent failures quarantine
DynamoDB for status tracking, idempotency, and content-hash duplicate detection
Glue for JSON → Parquet curation, on schedule, deliberately decoupled from extraction
Athena for analytics queries against the curated layer
CloudWatch alarms, dashboards, and SNS alerting
IAM with least-privilege roles scoped per service boundary
12 reusable Terraform modules composed through a single MVP root

Why the architecture matters for the benchmark

A benchmark without operational context is just a cost calculator. The platform gives both extractors the same conditions:

Quarantine zone — documents that fail validation don’t disappear silently. They’re preserved with metadata for triage and replay. Most MVPs skip this entirely.
Idempotency — content-hash duplicate detection in DynamoDB ensures the same invoice doesn’t get processed twice, regardless of which extractor runs it.
Failure classification — separates “the model couldn’t do the work” from “the runtime wasn’t ready.” This distinction is critical when comparing extractors at scale.
Reusable challenger lane — isolated S3 buckets and DynamoDB tracking per contender, so swapping models doesn’t pollute prior benchmark data.

The deprecation that became a validation

The original challenger was Claude 3 Haiku. It showed credible signal on a 60-document comparison — faster and cheaper than Textract, worth evaluating further.

Then Amazon announced Claude 3 Haiku was being deprecated on Bedrock. I evaluated the upgrade options:

Claude 3.5 Haiku — not available on Bedrock
Claude 4.5 Haiku — available, but slower than Claude 3 Haiku for this use case and roughly 5× more expensive

Neither was a clean replacement. So I looked at what else was on Bedrock and landed on Amazon Nova Lite — a lightweight multimodal model built for high-volume, cost-sensitive workloads.

Because the challenger lane was designed as a reusable path with a pluggable model ID, the swap was a configuration change. The S3 buckets, DynamoDB tracking, canonical JSON contract, prompt discipline, and scoring criteria all stayed the same. Only the model changed.

That’s the design paying off. A forced migration with zero architectural disruption.

What I measured

Four dimensions:

Required field coverage — did the output include the fields I need?
JSON completeness — was the rendered structure well-formed and complete?
Latency — how long per document?
Cost — what does extraction actually cost per successful invoice?

Customer-facing conclusions are scoped to two completed, defensible runs: a 60-doc comparison and a 250-doc benchmark slice. Larger 1,500-doc experiments are excluded because they mixed extraction behavior with unfinished runtime architecture. Choosing what to exclude is as important as choosing what to measure.

60-document comparison

First signal that Nova Lite deserved serious attention.

Metric	Textract	Nova Lite
Avg latency/doc	30.55s	6.63s
Cost/invoice	~$0.01	~$0.00036

Nova Lite was roughly 4.6× faster and 28× cheaper per invoice on the small slice. That doesn’t prove production-ready. It proves it’s worth benchmarking seriously.

250-document benchmark

The larger slice reinforced the signal — and the numbers held.

Lane	Success	Failed	Avg latency	P50	P95	Cost/invoice
Nova Lite	242	8	6.60s	6.58s	8.98s	$0.00036
Textract	238	12	41.85s	33.55s	91.61s	$0.01

That puts Nova Lite at roughly 28× cheaper on extraction cost per successful invoice.

The failure pattern

On the 250-doc run, Nova Lite had 8 failures vs. Textract’s 12. Both were operationally shaped — throttling and runtime behavior, not extraction quality failures. Neither lane produced quarantine-worthy output on documents that succeeded.

The tight latency profile matters too. Nova Lite’s P95 was 8.98s against Textract’s 91.61s. Not just faster on average — more predictable. For a production pipeline, latency variance matters as much as latency mean.

The tradeoff

Textract is the operational baseline. Purpose-built for document extraction, integrated into the AWS data platform stack, production-shaped today. If the priority is operational maturity and platform integration, that’s the defensible choice.

Nova Lite is the leading challenger. 28× cheaper, 6× faster, tighter latency distribution, fewer failures on the 250-doc benchmark. It’s no longer speculative — it’s the strongest extraction alternative measured on this platform.

This is exactly the kind of outcome I wanted from an MVP: not a forced winner, but a real, measurable tradeoff that informs the next decision.

Design decisions worth calling out

No VPC. Every service is AWS-managed and publicly accessible. NAT gateways and endpoint costs would add complexity with no security benefit for this workload. Scoping decision, not a shortcut.
Decoupled curation. Glue runs on schedule, not inline. Extraction and analytics scale independently; a Glue failure doesn’t block extraction.
Single MVP root. One Terraform composition root, not a premature multi-environment layout. The module structure supports expansion; the MVP doesn’t pay for complexity it doesn’t need yet.
Conservative field inference. The canonical schema preserves raw values alongside normalized ones and returns null rather than guessing. For invoice data, a wrong value is worse than a missing one.
Reusable challenger lane. When the deprecation hit, the architecture absorbed it. That wasn’t accidental.

On using AI tooling

I used Codex throughout this project for implementation, benchmarking, analysis, and writing. It materially accelerated delivery.

But the work that mattered most was still judgment: what to measure, what to exclude, what constitutes valid evidence, whether a failure was a model problem or a harness problem, and what I could say publicly without overstating the results.

AI tooling makes an engineer faster. It doesn’t make the engineering decisions for you.

Source

GitLab repository — Terraform modules, Lambda functions, Glue jobs, canonical schema, test fixtures
Project wiki — Architecture overview, design decisions, operational runbook, champion/challenger evaluation