Evaluation Guide
-
Goals
Detect regressions, compare configs, and track latency.
-
Datasets
Use
EvalDatasetItemwith expected file paths. -
Metrics
MRR, Recall@K, NDCG@10, p50/p95 latency.
Match prod
Align eval_final_k and eval_multi with production to avoid misleading results.
Compare Runs
Use /reranker/train/diff or evaluation comparison endpoints to see deltas and compatibility.
Small samples
Use small samples for iteration, but run full suites before shipping changes.
Typical Flow
flowchart TB
PREP["Prepare eval dataset"] --> RUN["Run evaluation"]
RUN --> ANALYZE["Analyze metrics"]
ANALYZE --> TUNE["Tune config"]
TUNE --> RUN import httpx
base = "http://localhost:8000"
# Trigger evaluation (1)!
print(httpx.post(f"{base}/reranker/evaluate", json={"corpus_id":"tribrid"}).json())
BASE=http://localhost:8000
curl -sS -X POST "$BASE/reranker/evaluate" -H 'Content-Type: application/json' -d '{"corpus_id":"tribrid"}' | jq .
// Load eval results and render charts
const result = await (await fetch('/reranker/evaluate', { method:'POST', headers:{'Content-Type':'application/json'}, body: JSON.stringify({ corpus_id: 'tribrid' }) })).json();
| Knob | Where | Default |
|---|---|---|
evaluation.eval_multi_m | TriBridConfig.evaluation | 10 |
retrieval.eval_final_k | TriBridConfig.retrieval | 5 |
retrieval.eval_multi | TriBridConfig.retrieval | 1 (on) |
Prompt analysis
Use system_prompts.eval_analysis to generate skeptical post-hoc analysis comparing two runs.