Post

May 13, 2026 AI Engineering

Shiny AI Demos Need Promotion Gates

LLMs, RAG, and agentic AI are the new shiny things. Every week brings a new model, framework, benchmark, or autonomous workflow pattern. The demos are impressive, and in many cases the business value is real.

But the question teams eventually run into is more uncomfortable:

At what cost?

Not just the invoice from a model provider. The operational cost. The latency cost. The retry cost. The cloud infrastructure cost. The governance cost. The cost of debugging an agent that made six tool calls, retrieved stale context, retried into a rate limit, and still gave an answer nobody can safely promote.

One of the more useful lessons from building RAG and agentic systems is that a working demo is not the same thing as a promotable workflow. A demo proves the system can work once. A promotion gate asks whether it is accurate, grounded, observable, safe, reliable, and cost-controlled enough to move forward.

That is the gap AgentOps Lab is meant to explore.

AgentOps Lab is a local-first evaluation, observability, reliability, and cost-governance workbench for AI agents and RAG workflows. The current M0 slice is deliberately small: a local CLI runs a Basic RAG workflow against a synthetic incident-response corpus, emits a JSON report, validates trace events, scores the output, and enforces a production-readiness gate.

In the current local run, the Harborlight incident-response suite has:

Cases: 13
Expected responses: 13
Cases passed: 13
Gate checks passed: 17 / 17
Red-team cases: 6
Trace events emitted: 53
Report schema: 0.1.0

Those numbers are not a claim that the broader platform is done. The API, web console, database persistence, cloud deployment, and richer observability exports are still target-state work. The useful part of the M0 slice is narrower and more important: it proves the promotion contract can exist locally before the platform grows around it.

A diagram showing an evaluation suite running against a workflow, producing traces, scores, reports, and a promotion gate decision.

The point is not to make AI adoption slower. The point is to make it measurable enough to scale responsibly.

The Pattern In Early Agentic Systems

Many early AI systems follow a familiar path.

First, someone builds a useful prompt. Then the prompt becomes a small RAG workflow. Then the workflow gets tools. Then it gets retries, routing, memory, more context, a more capable model, and a few special-case instructions for known failure modes.

That evolution is natural. It is also where operational risk starts to hide.

The system may still produce good answers in a demo. But underneath the answer, several things may have changed:

The prompt got longer.
Retrieved context grew.
Tool calls multiplied.
Retry behavior became less predictable.
Latency increased.
The model became more expensive.
Refusal behavior became inconsistent.
Citations looked plausible but pointed at weaker evidence.
A small workflow became an agent loop without a budget.

None of these problems always show up as an obviously wrong answer. That is what makes them dangerous.

A workflow can be accurate and still too expensive to run at scale. It can be helpful and still unsafe under prompt injection. It can cite sources and still cite the wrong ones. It can pass a happy-path demo and fail when the corpus is missing evidence. It can look reliable until provider capacity limits trigger a retry storm.

If the only release question is “did the answer look good?”, then the team is flying mostly by feel.

Quality Is Not The Only Promotion Criterion

Most teams understand that AI workflows need quality evaluation. Fewer treat cost, latency, retries, and tool fan-out as first-class promotion criteria.

In traditional software, performance regressions, error-rate increases, and resource blowups can block a release even when the feature technically works. AI workflows need the same discipline. A prompt change that improves groundedness by one point but doubles cost per case is not automatically a win. A model change that improves answer fluency but increases p95 latency beyond the product budget may not be promotable. An agent loop that solves a task by calling every available tool is not reliable just because it eventually gets there.

AgentOps Lab treats cost as part of the release decision.

The M0 gate can fail on quality signals:

groundedness
retrieval_relevance
citation_quality
refusal_behavior
secrets_disclosure
tool_call_correctness

It can also fail on operational and cost signals:

p95_latency_ms
failure_rate
estimated_cost_per_case_usd
total_estimated_run_cost_usd
input_tokens_per_case
tool_calls_per_case
retry_rate

The workflow is not only judged by whether it answered correctly, but by whether the whole run stayed inside the quality, safety, latency, and budget envelope.

A balanced gate diagram showing quality thresholds and cost controls both feeding a production-readiness decision.

What The Local Slice Actually Does

The current AgentOps Lab slice is intentionally local and deterministic.

That is a feature, not a limitation, for this stage. A local deterministic slice makes the evaluation contract testable without live cloud credentials, provider availability, Cognito sessions, model latency variance, or external spend. It lets the project prove the shape of the suite, workflow adapter, report, trace contract, and gate policy before adding distributed infrastructure.

The CLI command looks like this:

PYTHONPATH=packages/evaluator/src:packages/cli/src python3 -m agentops_cli.main eval \
  --suite configs/suites/incident-response.yaml \
  --workflow configs/workflows/basic-rag-v1.yaml \
  --gate configs/gates/prod-readiness-v1.yaml \
  --fail-under-groundedness 0.85 \
  --fail-under-citation-quality 0.90 \
  --max-p95-latency-ms 9000 \
  --max-estimated-cost-per-case-usd 0.01 \
  --report-out reports/evalrun-local.json

That command does a few concrete things.

It loads a curated evaluation suite. The Harborlight suite contains synthetic incident-response cases for groundedness, citation quality, missing evidence, conflicting evidence, prompt injection, unsafe remediation, ambiguous questions, tool-call fan-out, fake citation injection, stale documentation, and multi-document conflict.

It loads a workflow definition. In M0, that is a local Basic RAG workflow. The same boundary is meant to support agentic workflows and external workflow adapters later.

It runs every case and captures a normalized response contract: answer, citations, retrieved documents, latency, estimated cost, token usage, tool-call count, retry count, refusal status, scores, and trace events.

It validates the trace events before scoring. That matters because observability is not useful if traces are optional, malformed, or inconsistent.

It evaluates the output against the expected response contract. The initial scorers are deterministic. They are not meant to be magical judges. They are meant to catch known failure modes repeatably.

Finally, it applies the gate. If the quality metrics fall below threshold, the run fails. If latency or cost exceeds threshold, the run fails. If the workflow passes every case but becomes too expensive or too slow, the gate can still block promotion.

The Evaluation Suite Is The Product Boundary

A common failure mode in AI projects is treating evaluation as something bolted on after the product is mostly built.

AgentOps Lab takes the opposite view. The evaluation suite is part of the product boundary.

For the Harborlight incident-response slice, each case has a job. One case checks that the assistant grounds triage guidance in the checkout-api runbook. Another checks whether it cites both an incident record and a runbook. Another expects refusal when the corpus does not contain a notification-worker runbook. Several red-team cases test prompt injection, fake citations, tool-output injection, and unsafe remediation.

Generic “does this answer look good?” evaluation does not catch enough.

A useful incident assistant must do more than summarize. It needs to know when evidence is missing. It needs to prefer current policy over deprecated knowledge-base content. It needs to refuse unsafe actions. It needs to ignore hostile instructions embedded in retrieved context. It needs to avoid fake citations. It needs to scope broad requests instead of expanding into unbounded tool calls.

Once the suite names those requirements explicitly, the workflow has something real to pass.

Reports Should Be Reviewable Artifacts

One reason AI failures are hard to debug is that the answer is often the only artifact people look at.

That is not enough.

AgentOps Lab emits a versioned JSON report with run metadata, workflow metadata, evaluation profile, summary metrics, gate checks, and per-case results. The current local report uses schema version 0.1.0.

The summary fields below come from the local deterministic workflow — the millisecond latency and sub-cent cost are what a stub returns, not what live provider billing or production latency look like. Their job in M0 is to prove the report and gate can carry those metrics. Once live models and external workflows are plugged in, the same fields become much more consequential.

{
  "case_count": 13,
  "passed_count": 13,
  "failed_count": 0,
  "pass_rate": 1.0,
  "p95_latency_ms": 1.0,
  "estimated_cost_per_case_usd": 0.000027,
  "total_estimated_run_cost_usd": 0.00035,
  "total_tokens": 753,
  "total_tool_calls": 0,
  "retry_rate": 0.0,
  "red_team_case_count": 6,
  "red_team_pass_rate": 1.0,
  "trace_event_count": 53,
  "groundedness": 1.0,
  "citation_quality": 1.0,
  "refusal_behavior": 1.0
}

The important design point is that the report is a review artifact. It can be stored in CI. It can be compared across runs. It can explain why a gate failed. It can show which case failed, which metric failed, what evidence was retrieved, which citations were produced, and what trace events were emitted.

That is much healthier than asking someone to inspect a single generated answer and decide whether the workflow feels safe.

Trace Data Is Not Just For Dashboards

Observability is often introduced late, after a system is already difficult to reason about.

For AI workflows, that is backwards. Trace data needs to be part of evaluation from the beginning because many failures are behavioral, not just textual.

The current local Basic RAG workflow emits lightweight trace events such as:

workflow.started
retrieval.started
retrieval.filtered
retrieval.completed
workflow.completed

The planned incident-assistant workflow expands that shape with agent and tool events:

agent.plan.created
tool_call.started
tool_call.completed
agent.synthesis.completed

This is intentionally simpler than full distributed tracing. The local trace contract only requires stable event types and case IDs today. But it is designed so it can later map to OpenTelemetry spans for evaluation runs, retrieval operations, model calls, tool calls, scoring operations, and gate evaluation.

First, make local traces useful and valid. Then export them to whatever observability backend the deployment needs. If the trace contract is not useful locally, sending it to a bigger dashboard will not fix the underlying problem.

Gates Make Tradeoffs Visible

The most important benefit of a gate is not that it says “pass” or “fail.” The important benefit is that it forces tradeoffs into the open.

Consider a prompt change that improves citation quality but increases input tokens per case by 4x. Is that acceptable? Maybe. But it should be a conscious decision, not an invisible side effect.

Consider a model upgrade that improves groundedness but pushes p95 latency above the product budget. Is the quality improvement worth the latency cost? Maybe. But that is a release decision, not a hidden implementation detail.

Consider an agent loop that handles ambiguous questions by calling every tool it has. The final answer may look polished, but the workflow is behaving like an unbounded search process. That needs a budget.

The gate does not remove engineering judgment. It gives engineering judgment something concrete to inspect.

What The M0 Slice Pins Down

The first AgentOps Lab slice did not try to build the whole platform. The useful work was getting the local contracts to line up:

Evaluation suites are defined as YAML.
Expected responses live in structured fixtures.
Workflow definitions are versioned.
The runner normalizes workflow output.
Deterministic evaluators score known behaviors.
Trace events are validated before scoring.
The report schema captures summary and per-case results.
Gate policies enforce quality, latency, and cost thresholds.
The CLI exits non-zero when the gate fails.
CI can run the same local evaluation and store the report as an artifact.

That is a small vertical slice, but it is the right vertical slice. It proves the system can make a promotion decision from one command.

The broader platform can grow from there: external workflow adapters, agentic tool traces, stronger regulated-environment evaluators, PostgreSQL-backed run storage, a web console, OpenTelemetry export, and an AWS reference architecture.

But those pieces should not come before the local evaluation contract works. Otherwise the project risks becoming a dashboard around an unclear decision model.

The Result

The current local Basic RAG run passes the Harborlight incident-response suite:

Cases: 13
Passed: 13
Failed: 0
Gate status: passed
Gate checks: 17 / 17
Red-team pass rate: 1.0
Groundedness: 1.0
Citation quality: 1.0
Refusal behavior: 1.0
Secrets disclosure safety: 1.0

That result should be interpreted carefully.

It does not mean the system is production-ready. It does not mean the future API, web console, cloud deployment, or external workflow adapters are done. It does not mean deterministic local cost estimates are the same as provider billing data.

It means the M0 contract works: a workflow can be evaluated against a curated suite, produce traceable evidence, emit a report, and pass or fail a promotion gate.

That is a better claim than “the agent works.”

It says what was measured, what passed, and what remains.

The Bigger Point

The next AI engineering challenge is not getting agents to do more. It is proving they keep doing useful work under real operational constraints. That means evaluation has to cover retrieval, grounding, citations, refusals, safety, tool use, retries, latency, tokens, and cost — and turn that behavior into reviewable artifacts that fit normal engineering discipline.

AI workflows need release engineering.

The polished answer is only one part. The more important question is whether a workflow can explain itself: what evidence it used, what it refused, what it cost, how long it took, which tools it called, which risks it avoided, and why it should be promoted.

Production AI starts when quality and cost can fail the build.

Practical Takeaways

Treat a working demo as the beginning, not the promotion decision.
Evaluate quality and cost together. Accuracy alone is not enough.
Make refusal a first-class success case when evidence is missing or the request is unsafe.
Score retrieval, citations, grounding, tool use, and safety separately.
Put latency, token usage, tool calls, retries, and estimated cost in the report.
Make gate policies explicit and versioned.
Store evaluation reports as CI artifacts so failures are reviewable.
Validate trace events before scoring. Bad traces make bad debugging.
Keep the first slice local and deterministic until the contract is stable.
Do not build a dashboard before you know what decision the dashboard supports.

That is the difference between a shiny demo and an AI workflow that engineering teams can operate.