Evaluation & Regression Testing¶
AGRO ships with a full evaluation loop for the RAG pipeline. The goal isn’t to chase a single “accuracy” number – it’s to make it easy to:
- Build and maintain a local golden dataset
- Run repeatable evaluations against your current config
- Compare runs over time as you tweak retrieval, models, and prompts
- Drill into failures and traces when something regresses
This page focuses on how the evaluation system actually works in the current codebase and how to use it effectively.
Where evaluation data lives¶
All of the evaluation artifacts are plain JSON files on disk:
data/golden.json– primary golden dataset (questions + expected answers / contexts)data/evaluation_dataset.json– additional evaluation questions (often larger / noisier)data/evals/*.json– individual evaluation runsdata/backups/evals_/*.json– older baselines and snapshotsdata/tracking/evals_latest.json– pointer to the most recent run
On top of that, the web UI and CLI both talk to the same HTTP API to:
- Trigger an evaluation run
- Read/write golden questions
- Inspect historical runs and traces
What an evaluation run actually does¶
At a high level, an evaluation run is:
- Load a set of questions + expected answers from disk
- For each question:
- Run it through the same RAG pipeline as the normal
/ragor chat endpoints - Capture retrieval stats, model outputs, and traces
- Compute metrics (per‑question and aggregate)
- Write a self‑contained JSON report under
data/evals/
The important part is that evaluation uses the same code paths as production queries:
- Retrieval goes through
retrieval.hybrid_search.search_routed_multi - RAG answers go through
server/services/rag.py(andlanggraph_appif enabled) - Configuration is read from the same registry (
server/services/config_registry.py)
That means any change you make to:
agro_config.json.env- Indexing parameters
- Model configuration
…will be reflected in the next evaluation run.
Golden dataset format¶
The golden dataset is intentionally simple JSON so you can edit it by hand or via the UI.
A minimal entry looks like this:
| data/golden.json | |
|---|---|
Typical fields:
| Field | Description |
|---|---|
id |
Stable identifier for the question (used for diffing runs) |
question |
Natural language query to send into the RAG pipeline |
answer |
Expected answer or reference explanation |
tags |
Free‑form labels (e.g. "bm25", "langgraph", "config") |
metadata |
Optional extra info (e.g. target file, function, or scenario description) |
AGRO doesn’t force a particular schema beyond what the evaluation runner expects. If you want to add extra fields for your own tooling, you can – they’ll be preserved in the run outputs.
How configuration is captured per run¶
One of the more important design choices: every evaluation run captures the effective configuration that was used.
Internally, everything goes through the configuration registry:
.envis loaded first viapython-dotenvagro_config.jsonis parsed intoAgroConfigRoot(Pydantic)- Defaults live in the Pydantic model
- The registry exposes type‑safe getters (
get_int,get_float,get_bool,get_str)
When an evaluation run starts, AGRO snapshots the relevant parts of this registry and writes them into the eval JSON. That way you can later answer questions like:
- “What was
FINAL_Kwhen this baseline was recorded?” - “Were we using the learning reranker or pure dense search?”
- “Which provider/model pair was active for answers?”
This is why you’ll see a lot of data/backups/evals_/*.json – they’re just frozen views of “config + metrics + outputs” at a point in time.
Running evaluations¶
You can run evaluations from three places:
- CLI (
cli/commands/eval.py) - Web UI (Evaluation tab)
- HTTP API (for automation)
The CLI is usually the most direct when you’re iterating on retrieval.
# Run evaluation against the default repo (REPO or agro)
python -m cli.agro eval run
# Run against a specific repo/profile
python -m cli.agro eval run --repo my-project
# Use a specific dataset file
python -m cli.agro eval run --dataset data/golden.json
Each run will:
- Use the current index for the selected repo
- Use whatever models and retrieval settings are active in
agro_config.json/.env - Write a new
data/evals/eval_YYYYMMDD_HHMMSS.json - Update
data/tracking/evals_latest.jsonto point at it
What’s inside an eval run file¶
A typical eval run JSON has three main sections:
config– snapshot of relevant AGRO configurationquestions– per‑question results and metricssummary– aggregated metrics and counts
A heavily simplified sketch:
The exact metric set is still evolving – the important part is that each question keeps a pointer to its trace file (see below) and enough retrieval metadata to debug failures.
Traces and drill‑down¶
Evaluation runs are tightly integrated with the tracing system under out/<repo>/traces/.
The HTTP layer exposes a small helper in server/services/traces.py:
list_traces(repo)– list recent trace files for a repolatest_trace(repo)– fetch the most recent trace (viaserver.tracing.latest_trace_path)
During evaluation, each question is typically run with tracing enabled, and the resulting trace path is stored in the eval JSON. That gives you a direct path from “metric dropped” → “show me exactly what the pipeline did for this question.”
In the web UI, the Evaluation tab wires this up to the Trace Viewer component so you can:
- Click into a failing question
- Inspect the retrieval candidates, scores, and reranker decisions
- See the exact prompts and model responses used
Comparing runs¶
Because each run is a single JSON file, comparing runs is just diffing JSON:
data/evals/latest.json– often a symlink or copy of the most recent rundata/backups/evals_/*.json– older baselines you can diff against
A typical workflow when tuning retrieval:
- Run an eval with your current settings →
eval_baseline.json - Tweak
agro_config.json(e.g. adjust BM25 weights, enable learning reranker) - Re‑index if necessary (e.g. changed chunking or embeddings)
- Run eval again →
eval_new.json - Diff the two files:
jq '.summary' data/evals/eval_baseline.json
jq '.summary' data/evals/eval_new.json
# Or a full diff
diff -u data/evals/eval_baseline.json data/evals/eval_new.json | less
Because the per‑question id is stable, you can also script more detailed comparisons (e.g. “which questions flipped from correct → incorrect?”).
UI: Evaluation tab¶
The web UI has a dedicated Evaluation section built from several components:
EvaluationRunner.tsx– start/stop runs, pick datasetsHistoryViewer.tsx– browse past runs fromdata/evals/EvalDrillDown.tsx– per‑question view with metrics and tracesFeedbackPanel.tsx– record manual feedback on answersQuestionManager.tsx– edit / add golden questionsSystemPromptsSubtab.tsx– inspect and tweak system prompts used during evalTraceViewer.tsx– render the trace JSON referenced by each question
All of these talk to the same FastAPI endpoints that the CLI uses. There’s no separate “UI‑only” evaluation path.
Working with multiple repos¶
AGRO is repo‑aware across the whole stack. Evaluation follows the same pattern:
- The active repo is determined by
REPO(env) or request payload - Indexes live under
out/<repo>/... - Traces live under
out/<repo>/traces/ - Eval runs are tagged with
"repo": "<name>"
When you call the evaluation endpoints or CLI with --repo my-project, the server:
- Uses
get_config_registry()to resolveREPOand other settings - Reads/writes eval files under the same data directory but with the correct repo tag
- Uses the correct Qdrant collection for retrieval
This matters if you’re running AGRO against multiple codebases from a single instance.
Rough edges & things to know¶
A few honest caveats about the current evaluation implementation:
- Metrics are intentionally simple – they’re good for regression, not leaderboard‑style benchmarking
- The golden dataset format is stable but not “versioned” – if you change the shape, keep your own migration scripts
- Some older eval files under
data/backups/evals_predate newer metrics; don’t expect them all to have identical schemas - If you change low‑level config (e.g. chunking) without re‑indexing, evaluation will happily run against a stale index – it won’t try to be clever about that
Extending the evaluation pipeline¶
Because everything is just Python + JSON, extending the evaluation loop is straightforward:
- Add new metrics to the eval runner and write them into the per‑question
metricsblock - Store extra per‑question metadata (e.g. which MCP tools were used, which reranker checkpoint was active)
- Add new UI panels that read from the same eval JSON files
If you’re not sure where to hook in, open the AGRO chat tab and ask it about the evaluation code – the engine is indexed on itself, and it will point you at the relevant modules and functions.