sema.cloud
Project

Invoice Intelligence Platform

Serverless AWS · Champion/Challenger ML Eval

A four-zone data lake with event-driven extraction. Benchmarked Amazon Textract against Bedrock Nova Lite on the same canonical contract — Nova Lite came in 28× cheaper and 6× faster.

Outcomes

At a glance

28×
cheaper
faster
12
TF modules
Stack

Built with

  • AWS
  • S3
  • SQS
  • Step Functions
  • Lambda
  • Bedrock
  • Textract
  • DynamoDB
  • Glue
  • Athena
  • CloudWatch
  • Terraform
Detail

Case study

The problem

I wanted to answer a real engineering question: can a foundation model compete with a purpose-built extraction service when both are measured against the same contract, the same corpus, and the same operational standards? Slideware doesn’t answer that. A working platform does.

What I built

A fully serverless invoice extraction MVP on AWS. Common flow:

S3 (raw)  →  SQS  →  Step Functions  →  Extraction  →  DynamoDB  →  Glue  →  Athena

Two extraction lanes share that pipeline:

  • Textract laneLambda → Textract.AnalyzeExpense → DynamoDB (the AWS-native baseline)
  • Challenger laneLambda → Bedrock (Nova Lite) → DynamoDB (foundation-model challenger)

Same input corpus. Same canonical JSON contract. Same downstream pipeline. The only variable is the extraction engine.

Architecture choices worth calling out

  • Four-zone data lakeraw (source-of-truth), processed (normalized JSON), curated (Parquet for analytics), quarantine (failed docs preserved with metadata for triage and replay). Most MVPs skip quarantine. Mine treats it as load-bearing.
  • Failure classification — transient errors retry with backoff; schema violations quarantine. This separates “the model couldn’t do the work” from “the runtime wasn’t ready,” which matters when you’re comparing extractors at scale.
  • Reusable challenger lane — isolated S3 buckets, contender-prefixed DynamoDB keys, a pluggable model ID. Swapping models doesn’t disturb the baseline or pollute prior benchmark data.
  • Decoupled curation — Glue runs on schedule, not inline after each extraction. A Glue failure can’t block extraction.
  • No VPC — every service is AWS-managed and publicly accessible. Adding a VPC would have meant NAT gateways, endpoint costs, and complexity with no security benefit for this workload. That’s a scoping decision, not a shortcut.
  • 12 reusable Terraform modules composed through a single MVP root, with a shared S3 state backend.

The deprecation that became a validation

Original challenger: Claude 3 Haiku via Bedrock. Showed promise on a 60-document slice. Then Anthropic announced Claude 3 Haiku was being deprecated on Bedrock.

Neither upgrade was clean: Claude 3.5 Haiku isn’t on Bedrock; Claude 4.5 Haiku is, but slower and ~5× more expensive than Claude 3 Haiku for this workload. So I looked at what else was available and landed on Amazon Nova Lite.

Because the challenger lane was designed as a reusable path with a pluggable model ID, the swap was a configuration change. The S3 buckets, DynamoDB tracking, canonical JSON contract, prompt discipline, and scoring criteria all stayed the same. Only the model changed.

That’s the architecture paying off. A forced migration with zero structural disruption.

What I measured

Two completed, defensible runs informed customer-facing conclusions: a 60-doc comparison and a 250-doc benchmark. Larger 1,500-doc experiments were excluded — they mixed extraction behavior with unfinished runtime work. Choosing what to leave out was as important as choosing what to measure.

250-doc benchmark

LaneSuccessFailedAvg latencyP50P95Cost/invoice
Nova Lite24286.60s6.58s8.98s$0.00036
Textract2381241.85s33.55s91.61s$0.01

Nova Lite landed at roughly 28× cheaper and 6× faster on the larger slice — and the P95 advantage (8.98s vs. 91.61s) means it’s not just faster on average, it’s more predictable.

The tradeoff

Textract is the operational baseline — purpose-built, integrated, production-shaped today. If the priority is operational maturity, it’s the defensible choice.

Nova Lite is the leading challenger — 28× cheaper, 6× faster, tighter latency distribution, fewer failures on the 250-doc benchmark. It’s no longer speculative.

This is exactly what I wanted from an MVP: not a forced winner, but a real measurable tradeoff that informs the next decision.

What’s next

  • Textract: keep refining normalization, tune throughput for larger batches.
  • Nova Lite: inspect the 8 failed docs, hold prompt discipline, broaden the corpus.
  • Platform: the reusable challenger lane is proven. Future evaluations follow the same path without architectural changes.