Skip to content

Evaluation Guide

  • Goals


    Detect regressions, compare configs, and track latency.

  • Datasets


    Use EvalDatasetItem with expected file paths.

  • Metrics


    MRR, Recall@K, NDCG@10, p50/p95 latency.

Get started Configuration API

Match prod

Align eval_final_k and eval_multi with production to avoid misleading results.

Compare Runs

Use /reranker/train/diff or evaluation comparison endpoints to see deltas and compatibility.

Small samples

Use small samples for iteration, but run full suites before shipping changes.

Typical Flow

flowchart TB
    PREP["Prepare eval dataset"] --> RUN["Run evaluation"]
    RUN --> ANALYZE["Analyze metrics"]
    ANALYZE --> TUNE["Tune config"]
    TUNE --> RUN
import httpx
base = "http://localhost:8000"
# Trigger evaluation (1)!
print(httpx.post(f"{base}/reranker/evaluate", json={"corpus_id":"tribrid"}).json())
BASE=http://localhost:8000
curl -sS -X POST "$BASE/reranker/evaluate" -H 'Content-Type: application/json' -d '{"corpus_id":"tribrid"}' | jq .
// Load eval results and render charts
const result = await (await fetch('/reranker/evaluate', { method:'POST', headers:{'Content-Type':'application/json'}, body: JSON.stringify({ corpus_id: 'tribrid' }) })).json();
Knob Where Default
evaluation.eval_multi_m TriBridConfig.evaluation 10
retrieval.eval_final_k TriBridConfig.retrieval 5
retrieval.eval_multi TriBridConfig.retrieval 1 (on)
Prompt analysis

Use system_prompts.eval_analysis to generate skeptical post-hoc analysis comparing two runs.