Project

Invoice Intelligence Platform

Serverless AWS · Champion/Challenger ML Eval

A four-zone data lake with event-driven extraction. Benchmarked Amazon Textract against Bedrock Nova Lite on the same canonical contract — Nova Lite came in 28× cheaper and 6× faster.

Read Article → Project Wiki → GitLab Repo →

Outcomes

At a glance

28×

cheaper

6×

faster

TF modules

Stack

Built with

AWS
S3
SQS
Step Functions
Lambda
Bedrock
Textract
DynamoDB
Glue
Athena
CloudWatch
Terraform

Detail

Case study

The problem

I wanted to answer a real engineering question: can a foundation model compete with a purpose-built extraction service when both are measured against the same contract, the same corpus, and the same operational standards? Slideware doesn’t answer that. A working platform does.

What I built

A fully serverless invoice extraction MVP on AWS. Common flow:

S3 (raw)  →  SQS  →  Step Functions  →  Extraction  →  DynamoDB  →  Glue  →  Athena

Two extraction lanes share that pipeline:

Textract lane — Lambda → Textract.AnalyzeExpense → DynamoDB (the AWS-native baseline)
Challenger lane — Lambda → Bedrock (Nova Lite) → DynamoDB (foundation-model challenger)

Same input corpus. Same canonical JSON contract. Same downstream pipeline. The only variable is the extraction engine.

Architecture choices worth calling out

Four-zone data lake — raw (source-of-truth), processed (normalized JSON), curated (Parquet for analytics), quarantine (failed docs preserved with metadata for triage and replay). Most MVPs skip quarantine. Mine treats it as load-bearing.
Failure classification — transient errors retry with backoff; schema violations quarantine. This separates “the model couldn’t do the work” from “the runtime wasn’t ready,” which matters when you’re comparing extractors at scale.
Reusable challenger lane — isolated S3 buckets, contender-prefixed DynamoDB keys, a pluggable model ID. Swapping models doesn’t disturb the baseline or pollute prior benchmark data.
Decoupled curation — Glue runs on schedule, not inline after each extraction. A Glue failure can’t block extraction.
No VPC — every service is AWS-managed and publicly accessible. Adding a VPC would have meant NAT gateways, endpoint costs, and complexity with no security benefit for this workload. That’s a scoping decision, not a shortcut.
12 reusable Terraform modules composed through a single MVP root, with a shared S3 state backend.

The deprecation that became a validation

Original challenger: Claude 3 Haiku via Bedrock. Showed promise on a 60-document slice. Then Anthropic announced Claude 3 Haiku was being deprecated on Bedrock.

Neither upgrade was clean: Claude 3.5 Haiku isn’t on Bedrock; Claude 4.5 Haiku is, but slower and ~5× more expensive than Claude 3 Haiku for this workload. So I looked at what else was available and landed on Amazon Nova Lite.

Because the challenger lane was designed as a reusable path with a pluggable model ID, the swap was a configuration change. The S3 buckets, DynamoDB tracking, canonical JSON contract, prompt discipline, and scoring criteria all stayed the same. Only the model changed.

That’s the architecture paying off. A forced migration with zero structural disruption.

What I measured

Two completed, defensible runs informed customer-facing conclusions: a 60-doc comparison and a 250-doc benchmark. Larger 1,500-doc experiments were excluded — they mixed extraction behavior with unfinished runtime work. Choosing what to leave out was as important as choosing what to measure.

250-doc benchmark

Lane	Success	Failed	Avg latency	P50	P95	Cost/invoice
Nova Lite	242	8	6.60s	6.58s	8.98s	$0.00036
Textract	238	12	41.85s	33.55s	91.61s	$0.01

Nova Lite landed at roughly 28× cheaper and 6× faster on the larger slice — and the P95 advantage (8.98s vs. 91.61s) means it’s not just faster on average, it’s more predictable.

The tradeoff

Textract is the operational baseline — purpose-built, integrated, production-shaped today. If the priority is operational maturity, it’s the defensible choice.

Nova Lite is the leading challenger — 28× cheaper, 6× faster, tighter latency distribution, fewer failures on the 250-doc benchmark. It’s no longer speculative.

This is exactly what I wanted from an MVP: not a forced winner, but a real measurable tradeoff that informs the next decision.

What’s next

Textract: keep refining normalization, tune throughput for larger batches.
Nova Lite: inspect the 8 failed docs, hold prompt discipline, broaden the corpus.
Platform: the reusable challenger lane is proven. Future evaluations follow the same path without architectural changes.