Invoice Intelligence Platform
A four-zone data lake with event-driven extraction. Benchmarked Amazon Textract against Bedrock Nova Lite on the same canonical contract — Nova Lite came in 28× cheaper and 6× faster.
At a glance
Built with
- AWS
- S3
- SQS
- Step Functions
- Lambda
- Bedrock
- Textract
- DynamoDB
- Glue
- Athena
- CloudWatch
- Terraform
Case study
The problem
I wanted to answer a real engineering question: can a foundation model compete with a purpose-built extraction service when both are measured against the same contract, the same corpus, and the same operational standards? Slideware doesn’t answer that. A working platform does.
What I built
A fully serverless invoice extraction MVP on AWS. Common flow:
S3 (raw) → SQS → Step Functions → Extraction → DynamoDB → Glue → Athena
Two extraction lanes share that pipeline:
- Textract lane —
Lambda → Textract.AnalyzeExpense → DynamoDB(the AWS-native baseline) - Challenger lane —
Lambda → Bedrock (Nova Lite) → DynamoDB(foundation-model challenger)
Same input corpus. Same canonical JSON contract. Same downstream pipeline. The only variable is the extraction engine.
Architecture choices worth calling out
- Four-zone data lake —
raw(source-of-truth),processed(normalized JSON),curated(Parquet for analytics),quarantine(failed docs preserved with metadata for triage and replay). Most MVPs skip quarantine. Mine treats it as load-bearing. - Failure classification — transient errors retry with backoff; schema violations quarantine. This separates “the model couldn’t do the work” from “the runtime wasn’t ready,” which matters when you’re comparing extractors at scale.
- Reusable challenger lane — isolated S3 buckets, contender-prefixed DynamoDB keys, a pluggable model ID. Swapping models doesn’t disturb the baseline or pollute prior benchmark data.
- Decoupled curation — Glue runs on schedule, not inline after each extraction. A Glue failure can’t block extraction.
- No VPC — every service is AWS-managed and publicly accessible. Adding a VPC would have meant NAT gateways, endpoint costs, and complexity with no security benefit for this workload. That’s a scoping decision, not a shortcut.
- 12 reusable Terraform modules composed through a single MVP root, with a shared S3 state backend.
The deprecation that became a validation
Original challenger: Claude 3 Haiku via Bedrock. Showed promise on a 60-document slice. Then Anthropic announced Claude 3 Haiku was being deprecated on Bedrock.
Neither upgrade was clean: Claude 3.5 Haiku isn’t on Bedrock; Claude 4.5 Haiku is, but slower and ~5× more expensive than Claude 3 Haiku for this workload. So I looked at what else was available and landed on Amazon Nova Lite.
Because the challenger lane was designed as a reusable path with a pluggable model ID, the swap was a configuration change. The S3 buckets, DynamoDB tracking, canonical JSON contract, prompt discipline, and scoring criteria all stayed the same. Only the model changed.
That’s the architecture paying off. A forced migration with zero structural disruption.
What I measured
Two completed, defensible runs informed customer-facing conclusions: a 60-doc comparison and a 250-doc benchmark. Larger 1,500-doc experiments were excluded — they mixed extraction behavior with unfinished runtime work. Choosing what to leave out was as important as choosing what to measure.
250-doc benchmark
| Lane | Success | Failed | Avg latency | P50 | P95 | Cost/invoice |
|---|---|---|---|---|---|---|
| Nova Lite | 242 | 8 | 6.60s | 6.58s | 8.98s | $0.00036 |
| Textract | 238 | 12 | 41.85s | 33.55s | 91.61s | $0.01 |
Nova Lite landed at roughly 28× cheaper and 6× faster on the larger slice — and the P95 advantage (8.98s vs. 91.61s) means it’s not just faster on average, it’s more predictable.
The tradeoff
Textract is the operational baseline — purpose-built, integrated, production-shaped today. If the priority is operational maturity, it’s the defensible choice.
Nova Lite is the leading challenger — 28× cheaper, 6× faster, tighter latency distribution, fewer failures on the 250-doc benchmark. It’s no longer speculative.
This is exactly what I wanted from an MVP: not a forced winner, but a real measurable tradeoff that informs the next decision.
What’s next
- Textract: keep refining normalization, tune throughput for larger batches.
- Nova Lite: inspect the 8 failed docs, hold prompt discipline, broaden the corpus.
- Platform: the reusable challenger lane is proven. Future evaluations follow the same path without architectural changes.