Evaluation Guide

Goals

Detect regressions, compare configs, and track latency.
Datasets

Use EvalDatasetItem with expected file paths.
Metrics

MRR, Recall@K, NDCG@10, p50/p95 latency.

Get started Configuration API

Match prod

Align eval_final_k and eval_multi with production to avoid misleading results.

Compare Runs

Use /reranker/train/diff or evaluation comparison endpoints to see deltas and compatibility.

Small samples

Use small samples for iteration, but run full suites before shipping changes.

Typical Flow

flowchart TB
    PREP["Prepare eval dataset"] --> RUN["Run evaluation"]
    RUN --> ANALYZE["Analyze metrics"]
    ANALYZE --> TUNE["Tune config"]
    TUNE --> RUN

Python

import httpx
base = "http://localhost:8000"
# Trigger evaluation (1)!
print(httpx.post(f"{base}/reranker/evaluate", json={"corpus_id":"tribrid"}).json())

curl

BASE=http://localhost:8000
curl -sS -X POST "$BASE/reranker/evaluate" -H 'Content-Type: application/json' -d '{"corpus_id":"tribrid"}' | jq .

TypeScript

// Load eval results and render charts
const result = await (await fetch('/reranker/evaluate', { method:'POST', headers:{'Content-Type':'application/json'}, body: JSON.stringify({ corpus_id: 'tribrid' }) })).json();

Knob	Where	Default
`evaluation.eval_multi_m`	`TriBridConfig.evaluation`	10
`retrieval.eval_final_k`	`TriBridConfig.retrieval`	5
`retrieval.eval_multi`	`TriBridConfig.retrieval`	1 (on)

Prompt analysis

Use system_prompts.eval_analysis to generate skeptical post-hoc analysis comparing two runs.