sema.cloud
Post
GenAI / RAG

RAG Retrieval Tuning

One of the more useful lessons from building Tenant Lens was that the language model was not the primary bottleneck. Tenant Lens is a multi-tenant GenAI support system for DevOps and SRE teams. It ingests operational documents, indexes them by tenant, retrieves relevant evidence, and asks a language model to produce grounded answers with citations.

In our first full ground-truth run, answer behavior passed at 94.29%, but retrieval passed at only 48.57%. The model usually behaved reasonably when it had the right evidence. The system failed when it retrieved plausible but wrong evidence. After one focused retrieval tuning pass, overall quality moved from 48.57% to 94.29%, and retrieval passed at 97.14%.

That improvement did not come from swapping the answer model. It came from treating retrieval as an engineering problem.

Tenant Lens retrieval quality improved from 48.57 percent to 94.29 percent after retrieval tuning.

Tenant Lens did not get better because we changed the LLM. It got better because the right evidence started reaching the LLM.

The Pattern In Early RAG Systems

Many early retrieval-augmented generation systems start with the same basic pattern:

  • Ingest Documents
  • Split Them Into Chunks
  • Generate Embeddings
  • Store Vectors
  • Run Similarity Search
  • Hand The Top Results To The Model

That pattern is a reasonable starting point. It is not a complete retrieval strategy.

Modern language models are already capable at summarization, synthesis, and conversational response generation once they are given the right evidence. The harder problem is often getting the system to retrieve the correct evidence in the first place.

That sounds obvious, but it is easy to miss because weak retrieval can still produce fluent answers. A model can generate a polished response from incomplete context. It can sound useful while being grounded in the wrong document. In production RAG systems, that distinction matters.

A RAG pipeline showing documents moving through ingestion, embeddings, hybrid retrieval, generation, and citation.

Garbage In, Garbage Retrieval

A surprising number of RAG issues begin long before generation. They start during ingestion.

If the ingestion pipeline produces inconsistent chunking, weak metadata, poor document normalization, incomplete provenance, or noisy embeddings, the retrieval layer inherits those problems permanently. The language model cannot compensate for operational context that never made it into the index.

This becomes especially visible in enterprise environments where documents contain exact identifiers, operational procedures, service ownership, configuration values, incident references, compliance artifacts, vendor terminology, and organizational shorthand.

Semantic similarity alone is often insufficient. A retrieval system may find documents that are conceptually related while still missing the exact operational evidence required to answer correctly. In many environments, that difference is the difference between a useful copilot and a risky one.

Embeddings Are Powerful, But Incomplete

Embeddings are extremely effective for semantic recall. An embedding model converts text into a vector, which is a list of numbers that represents the text in a way that preserves useful semantic relationships. Text with similar meaning tends to land near other text with similar meaning in vector space.

That is why embeddings work well for questions where the user’s wording differs from the source document. In an operational corpus, phrases like payment failures, checkout degradation, Stripe outage, and transaction retries may all be related. A good embedding model can recognize those as similar operational themes even when the words are not identical.

Tenant Lens uses Amazon Titan Text Embeddings V2 for document chunks and user questions. The OpenSearch index stores each chunk’s embedding in a 1024-dimension vector field. At query time, the query Lambda embeds the user’s question and searches for nearby chunks.

That capability is foundational to modern RAG systems. But embeddings are not magic.

Operational and enterprise content frequently depends on lexical precision: incident IDs, hostnames, filenames, service names, ticket numbers, error signatures, configuration keys, and procedural document names. Pure vector retrieval often struggles to distinguish nearby context from the correct source. It may understand that a question is about incidents while missing the specific postmortem. It may understand that a question is about ownership while failing to prefer the service catalog. It may retrieve a generally relevant runbook while missing the SOP that actually answers the procedure question.

That is why production retrieval systems usually evolve toward hybrid retrieval approaches that combine vector similarity, lexical search, metadata-aware ranking, category-aware boosting, and provenance-aware filtering. The most successful systems rarely rely on embeddings alone.

A layered retrieval signals stack showing embeddings surrounded by lexical precision, category boosts, source provenance, and evaluation.

Metadata Quality Matters More Than Most Teams Expect

One of the most overlooked aspects of RAG design is metadata preservation. If the index stores only chunk text and embeddings, the retrieval layer has very little operational structure available. It can search for related language, but it cannot easily reason about document category, ownership, source lineage, tenant scope, or whether a chunk came from a runbook, postmortem, service catalog, or SOP.

High-quality retrieval systems usually preserve source provenance, document category, section hierarchy, ownership context, timestamps, tenant or organizational scope, operational classification, and stable identifiers.

Tenant Lens indexes one document per chunk and preserves fields such as:

  • tenant_id
  • doc_id
  • chunk_id
  • source_key
  • filename
  • document_category
  • document_type
  • title
  • section_title
  • heading_path
  • content
  • embedding_vector

The canonical source key follows this shape:

{tenant_id}/{document_category}/{filename}

That structure matters. It gives the retrieval layer a stable category signal, gives citations enough provenance to point back to source material, and gives the evaluator enough information to verify whether the expected evidence was retrieved.

Good metadata lets retrieval behave less like searching generic text and more like querying operational records.

Chunking Is Not A Cosmetic Decision

Chunking strategy directly impacts retrieval behavior. Chunks that are too small lose context. Chunks that are too large dilute relevance scoring and introduce noisy retrieval. Poor chunk boundaries can separate procedures from prerequisites, detach incidents from remediation steps, or split operational context across unrelated embeddings.

In technical corpora, chunking often needs to align with document hierarchy, operational procedures, section semantics, runbook flow, and troubleshooting sequences. This is not just a formatting concern. It materially affects retrieval accuracy.

For Tenant Lens, the important design decision was not just “split the document.” It was to preserve enough source structure around each chunk so retrieval, citations, and evaluation could still understand what the chunk represented.

Evaluation Changes Everything

The biggest shift in RAG engineering happens when teams begin evaluating retrieval quality separately from answer quality. Without evaluation, systems often appear better than they are because modern models are good at producing plausible responses from incomplete evidence. That can hide weak retrieval behavior for a long time.

Tenant Lens uses a ground-truth evaluation set with expected source evidence. The evaluator scores:

  • behavior_pass: Did The System Answer Or Refuse Appropriately?
  • retrieval_pass: Did The Retrieved Sources Match The Expected Evidence?
  • overall_pass: Did Both Behavior And Retrieval Pass?

The baseline result made the problem obvious:

  • Questions: 35
  • Overall Pass: 17
  • Behavior Pass: 33
  • Retrieval Pass: 17
  • Overall Pass Rate: 48.57%
  • Behavior Pass Rate: 94.29%
  • Retrieval Pass Rate: 48.57%

The model was not the main issue. Retrieval was.

Once retrieval was measured independently, the failure patterns became clear:

  • Ownership Questions Missed service-catalog Documents
  • Procedural Questions Missed knowledge-base And sops Documents
  • Historical Questions Missed incidents And postmortems
  • Known-Error Questions Missed known-errors Documents
  • Troubleshooting Questions Sometimes Preferred Generic Operational Context Over Targeted Support Material

That changed where engineering effort went.

What We Changed In Tenant Lens

The tuning pass had four practical parts.

First, we kept tenant scope hard-bounded. Tenant Lens uses one OpenSearch index per tenant, and the query Lambda resolves the caller’s allowed tenant scope before retrieval. The system does not retrieve broadly and filter later.

Second, we moved from vector-only retrieval to hybrid retrieval. The tuned query path combines vector search over embedding_vector, full-text matching across fields such as title, section_title, heading_path, and content, and lexical matching against source_key and filename.

Third, we added category-aware boosting. The query path infers lightweight retrieval preferences from the user’s question:

  • Ownership Questions Boost service-catalog
  • Procedural Questions Boost knowledge-base And sops
  • Historical Questions Boost incidents And postmortems
  • Known-Error Questions Boost known-errors And troubleshooting
  • Operational Readiness Questions Boost sops, runbooks, And oncall-handoffs
  • Troubleshooting Questions Boost troubleshooting, runbooks, And known-errors
  • Architecture Questions Boost architecture And service-catalog
  • Runbook, Outage, And Failover Questions Boost runbook Documents

Fourth, we returned richer source identifiers. That improved citation quality, debugging clarity, and evaluation fidelity.

None of this replaced embeddings. The system still uses embeddings for semantic recall. The tuning added lexical precision, metadata structure, and domain-aware ranking around the embeddings.

The Result

After the first tuning pass, the same full evaluation set produced:

  • Questions: 35
  • Overall Pass: 33
  • Behavior Pass: 34
  • Retrieval Pass: 34
  • Overall Pass Rate: 94.29%
  • Behavior Pass Rate: 97.14%
  • Retrieval Pass Rate: 97.14%

The largest improvements came from the categories that had been weak:

  • Ownership Questions Started Retrieving Service Catalog Entries Correctly
  • Procedural Questions Started Retrieving Knowledge-Base And SOP Documents Reliably
  • Historical Questions Started Retrieving Incidents And Postmortems
  • Known-Error Questions Started Retrieving Known-Error Documents
  • Source Identifiers Became Clearer For Citations And Evaluation

Two evaluation cases still failed after the pass. That is worth saying explicitly. A 94.29% result does not mean the system is done. It means the retrieval layer became strong enough for the current MVP and evaluation set, with remaining gaps narrow enough to tune later.

That is a healthier claim than “the RAG works.” It says what was measured, what changed, and what remains.

Retrieval Is An Engineering Discipline

A lot of AI discussion still focuses primarily on models. But in production RAG systems, retrieval engineering often has a larger impact on quality than swapping foundation models.

Better ingestion pipelines, cleaner metadata, stronger provenance, improved chunking, hybrid retrieval strategies, and measurable evaluation frameworks frequently produce larger gains than changing the LLM itself. That is especially true in operational and enterprise environments where correctness matters more than conversational fluency.

The retrieval layer is not just plumbing around the model. It determines whether generation has a chance to be correct.

Practical Takeaways

  • Start With Evaluation Before Tuning. If You Do Not Have A Ground-Truth Set, You Are Guessing.
  • Score Retrieval Separately From Answer Quality. A Model Can Produce A Reasonable Answer From Weak Evidence.
  • Treat Ingestion Quality As Retrieval Quality. Chunking, Normalization, Provenance, And Metadata All Affect What Retrieval Can Do Later.
  • Use Embeddings For Semantic Recall, Not As The Entire Retrieval Strategy.
  • Use Hybrid Retrieval By Default For Technical And Operational Content.
  • Preserve Metadata As First-Class Index Data, Not As An Afterthought.
  • Treat Tenant, User, Or Organizational Scope As Part Of Retrieval, Not Just Authorization.
  • Make Refusal A Valid Success Case When The Corpus Does Not Contain Enough Evidence.
  • Do Not Treat Retrieval Scores As Answer Confidence. A Retrieval Score Is A Ranking Signal, Not Proof That The Generated Answer Is Correct.

Closing Thought

RAG quality is often retrieval quality wearing a generation interface.

Embeddings are a powerful foundation because they let a system search by meaning. But production-grade RAG, especially for operational and enterprise domains, usually needs more structure: metadata, lexical matching, tenant-aware scope, category-aware ranking, citations, refusals, and a repeatable evaluation harness.

Tenant Lens improved from 48.57% to 94.29% because we stopped treating retrieval as “find semantically nearby chunks” and started treating it as an engineered evidence-selection problem.