sema.cloud
Post
ML / RAG

Building a Multi-Tenant RAG Platform on AWS

I built Tenant Lens to answer a question that kept coming up in my work: what does it actually take to build a RAG system that’s production-shaped — not just a retrieval demo, but a platform with tenant isolation, failure handling, authorization boundaries, and measurable retrieval quality?

The answer turned out to be mostly architecture, not model selection.

The Problem

DevOps and SRE teams accumulate operational knowledge in runbooks, postmortems, design documents, incident reports, and troubleshooting guides. That knowledge is valuable, but it’s fragmented — scattered across file shares, wikis, and team-specific repositories. When an engineer needs an answer at 2 AM during an incident, searching across those sources is the last thing they want to deal with.

Tenant Lens makes that knowledge searchable in plain language. Ask “how do we restart the payment service?” or “who owns the fleet ingestion pipeline?” and get a grounded answer with citations pointing back to the source documents.

The hard part isn’t generating an answer. The hard part is everything around it: getting documents in reliably, keeping tenants isolated, retrieving the right evidence, and making the whole system observable and auditable.

The Platform

Ingestion flow: S3 > SQS > Lambda > Embed + Index > OpenSearch Query flow: API Gateway > Lambda > OpenSearch > Retrieve + Generate > Response

The architecture is fully serverless on AWS:

  • S3 — a single versioned ingestion bucket with event-driven triggers to SQS
  • SQS for ingestion buffering with a dead-letter queue for unprocessable messages
  • Lambda functions for ingestion (starter, worker) and query handling
  • DynamoDB for ingestion status tracking, idempotency state, and content-hash duplicate detection
  • Amazon Bedrock for embeddings (Titan Text Embeddings V2, 1024 dimensions) and answer generation (Claude Sonnet via inference profile)
  • Amazon OpenSearch for per-tenant vector indexing and hybrid retrieval
  • API Gateway with JWT authorization for the query endpoint
  • Cognito for authentication, user management, and tenant-to-group mapping
  • CloudFront with WAF for the React frontend
  • CloudWatch alarms, dashboards, and structured logging for operational visibility

The infrastructure is modular Terraform with a shared S3 state backend, composed through a single MVP root.

Ingestion: Getting Documents In Reliably

The ingestion pipeline is event-driven and fully asynchronous. Documents uploaded to S3 emit object-created events to SQS. A starter Lambda consumes the queue and kicks off processing. The ingestion worker Lambda validates the document, extracts and normalizes text, chunks it, generates embeddings via Bedrock, and indexes the chunks into the tenant’s OpenSearch index.

Idempotency Is Not Optional

This was a first-class design requirement, not an afterthought.

Every document gets a deterministic identity: doc_id derived from tenant ID and S3 key, source_fingerprint from the object’s version ID or content hash. Every chunk gets a deterministic chunk_id from its parent document and ordinal position.

The rules are simple:

  • Same document, same fingerprint — already indexed. Skip it. Record the duplicate delivery in DynamoDB and move on.
  • Same document, new fingerprint — living document update. Delete the old chunks from OpenSearch, index the new ones, update DynamoDB.
  • New document — full ingestion: validate, extract, chunk, embed, index, record.

Without this, re-uploading a batch of documents would create duplicate chunks in the vector store. Retrieval quality would degrade silently. Every RAG system needs this; most demos skip it.

Failure Handling

Failures are tracked, not dropped.

When ingestion fails, the status is recorded in DynamoDB with failure context — what went wrong, at which stage, for which document. Messages that can’t be processed after retries land in the SQS dead-letter queue for triage.

This means you can query DynamoDB to see what failed and why, and you can redrive the DLQ to retry. No silent data loss. Most MVPs skip this entirely.

Multi-Tenancy: Isolation That Isn’t Cosmetic

Tenant isolation was the architectural decision I spent the most time on, and I documented the reasoning in an ADR.

Per-Tenant Indexes

Each tenant gets its own OpenSearch index. Not a shared index with a tenant_id metadata filter — a dedicated index.

This is more expensive to operate. It’s also unambiguously isolated. There’s no retrieval path where a query accidentally crosses a tenant boundary because a filter was misconfigured or missing. The isolation is structural, not conditional.

Authorization Before Retrieval

The query Lambda resolves the user’s tenant scope from their JWT claims (Cognito groups) before the query ever touches OpenSearch. The flow:

  1. Extract Cognito groups from the token
  2. Map groups to allowed tenant IDs
  3. Validate that the requested scope is a subset of the allowed scope
  4. Build the target index list from the validated scope
  5. Execute retrieval only against authorized indexes

You can’t retrieve what you’re not authorized to see. This isn’t post-filtering — it’s pre-scoping. The distinction matters for compliance and for trust.

Power-User Model

The architecture supports cross-tenant queries for elevated users (admins, platform teams). These are modeled as fan-out across multiple authorized indexes with merged results — not as a bypass of the isolation model. Cross-tenant queries are audited separately.

Simple KNN search against embeddings gets you started. It doesn’t get you to useful answers for operational questions. Tenant Lens uses hybrid retrieval with multiple signals:

Vector similarity — KNN search on the embedding vector. This handles semantic matching: “how do we handle payment failures” finds documents about error handling even if they don’t use the word “failure.”

Full-text search — multi-match queries across title, section headings, and content. This catches exact terminology that vector search might miss.

Lexical boosting — the query Lambda extracts key terms from the question and boosts matches in filenames and source keys. If someone asks about “the runbook for fleet ingestion,” the word “fleet” in a filename matters.

Query-time preference inference — the system analyzes the question text and infers which document types and categories are most relevant. A troubleshooting question boosts runbooks. An architecture question boosts design docs. This happens at query time, not at index time.

The result is a ranked set of evidence chunks that feed into the answer generation prompt.

Answer Generation: Grounded, Not Hallucinated

When retrieval returns evidence, the query Lambda calls Claude Sonnet via Bedrock with a prompt that includes the source chunks, their citations, and instructions to ground the answer in the retrieved evidence.

The response includes:

  • A generated answer
  • Citations pointing to specific source documents and sections
  • A list of source evidence with confidence scores and excerpts
  • Operational guidance (next steps the user might take)

When retrieval returns nothing — no relevant evidence for the question — the system refuses to answer. It returns a refusal with guidance on how to refine the question or where to look. No evidence means no answer. That’s a deliberate design choice.

If the answer model itself fails (timeout, throttle), the system falls back to a retrieval-only response: here are the most relevant sources, even though I couldn’t generate a synthesized answer. The user still gets value. The fallback is logged and metered.

Evaluation: Measuring What Matters

I built a ground-truth evaluation framework with 35 test questions per tenant, covering:

  • Ownership questions — “Who owns the payment service?”
  • Procedural questions — “How do we restart the ingestion pipeline?”
  • Historical questions — “What happened during the March outage?”
  • Architectural questions — “How does the fleet tracking system handle GPS data?”

Each question has expected behavior (should the system answer or refuse?) and expected sources (which documents should appear in retrieval?). The evaluation tracks two metrics separately:

  • Behavior pass rate — did the system answer when it should have and refuse when it should have? Currently at 94.29%.
  • Retrieval pass rate — did the right sources appear in the top results? Currently at 48.57%.

The behavior pass rate tells me the answer model is solid. The retrieval pass rate tells me where the active engineering work is. These are very different problems, and measuring them separately keeps the diagnosis clear.

I track changes in a tuning diary: what I adjusted (chunk size, overlap, boost weights, index mappings), what moved, and what didn’t. This is the engineering notebook approach — measure, change one thing, measure again.

Design Decisions Worth Calling Out

Per-tenant indexes over shared indexes. More expensive. Unambiguously isolated. Documented in an ADR with the tradeoff analysis.

Authorization before retrieval. Scope narrowing from JWT claims, not post-retrieval filtering. The query never touches data the user isn’t authorized to see.

DynamoDB status tracking + DLQ over silent failure. Every ingestion attempt is recorded with status and context. Unprocessable messages land in the dead-letter queue. Nothing fails silently.

Conservative answer generation. No evidence means no answer. The system refuses rather than hallucinating. Temperature set to 0.2 for low variance.

Living document support. The same document can be re-ingested with updated content. Old chunks are replaced, not duplicated. The vector store stays clean.

The Corpus

Each tenant in the MVP has its own corpus of 300+ synthetic documents spread across operational document categories: runbooks, postmortems, design documents, SOPs, incident reports, architecture notes, troubleshooting guides, service catalog entries, and oncall handoffs.

The three simulated tenants are:

  • Forgecraft Supply — e-commerce supply chain operations
  • Roadkeep Fleet — fleet management and vehicle tracking
  • Northbeam Creative — creative agency project management

Each corpus was designed to be realistic and domain-specific, with enough depth to test cross-document retrieval, ambiguous questions, and edge cases where multiple tenants have similar but distinct operational knowledge.

On Using AI Tooling in the Build

I used Codex throughout this project for implementation, infrastructure, corpus generation, and evaluation tooling. It materially accelerated delivery — particularly on the ingestion worker, the hybrid retrieval logic, and the ground-truth evaluation framework.

But the decisions that shaped the platform were still judgment calls: the tenant isolation model, the authorization boundary, what to measure, how to handle failures, and where to draw the line between MVP scope and future-state ambition.

AI tooling makes an engineer faster. It doesn’t replace the engineering thinking that determines whether you’re building the right thing.

Source Code and Documentation

The platform, infrastructure, and architecture documentation are all open:

  • GitLab Repository — Terraform modules, Lambda functions, OpenSearch schema, and evaluation tooling
  • Project Wiki — Architecture overview, ADRs, operational runbooks, and tuning diary

The platform is live at tenant-lens.sema.cloud.

Assumptions

  • The current MVP supports markdown and plain-text documents only (PDF and multimodal ingestion is architecturally ready but not implemented)
  • OpenSearch is provisioned (not serverless) as a cost decision for MVP
  • The synthetic corpus is designed for realistic retrieval testing, not production operational use
  • Retrieval pass rate reflects an active tuning frontier, not a finished system
  • All evaluation metrics are from the most recent ground-truth run