Bundles
A bundle is the durable artifact tree for one completed round. It is the product output reviewers and tools inspect.
Write path: src/searchbench-go/internal/adapters/bundle/fsModels: src/searchbench-go/internal/pure/report, src/searchbench-go/internal/pure/score
First files to inspect
Open report.json, then report.txt. These are the canonical human/agent summaries (mode, freshness, pass/fail, failure counts, attempt aggregates). Detailed evidence lives in round-report.json and related files.
Example tree
Path: configs/rounds/local-ic-vs-jcodemunch/evidence/bundle/games/code-localization/rounds/round-001/
The same shape appears under {bundle-root}/games/code-localization/rounds/<round-id>/ after a repo-owned Buck run.
COMPLETE
report.json
report.txt
resolved-round.json
round-report.json
round-report.txt
evidence.pkl
objective.json
decision.json
metadata.json
continuation.json
continuation.pkl
policies/challenger_policy.py| File | Role |
|---|---|
COMPLETE | Marker that the round finished |
report.json / report.txt | Canonical summary — inspect first |
resolved-round.json | Fully resolved manifest + config snapshot |
round-report.json / .txt | Detailed comparison report (evidence-level) |
evidence.pkl | Evidence document for Pkl scoring |
objective.json | Result of localization-objective.pkl |
decision.json | PROMOTE_CHALLENGER / REVIEW / REJECT (or NO_DECISION for stability probes) |
metadata.json | Bundle ids, hashes, provenance |
continuation.json / .pkl | Survivor state for the next round manifest |
policies/ | Staged challenger (and related) policy files |
evaluate_n aggregate bundle
Multi-attempt evaluation (buck2 run //configs/rounds/live-ic-vs-jcodemunch:evaluate_n) writes one aggregate checked-in bundle. Attempt count comes from the Buck target (evaluate_attempts); aggregation policy and provider determinism come from Pkl (evaluator.aggregation, evaluator.determinism).
report.json # human summary + attempt pass rates
aggregate-report.json # per-metric mean/median/stddev/best/worst
round-report.json # aggregate comparisons (not last attempt)
evidence.pkl # aggregate evidence (e.g. goldHop.challenger.mean)
objective.json # Pkl objective over aggregate evidence
metadata.json # evaluation / aggregation / determinism provenance
COMPLETERaw per-attempt bundles are optional local debug output under .searchbench/attempts/<round-id>/<run-id>/attempt-NNN/ (gitignored). They are not required in the published bundle.
Metric rows in aggregate-report.json include count, mean, median, stddev, best, worst, and optional per-attempt values when enabled in Pkl.
Short excerpts
decision.json:
{
"decision": "PROMOTE_CHALLENGER",
"reason": "challenger improves the composite score in local fake comparison"
}continuation.json (start):
{
"schema_version": "searchbench.continuation.v1",
"bundle_id": "round-001",
"game": { "id": "code-localization", "kind": "code_localization" }
}Next round input: configs/rounds/optimize-ic/round.pkl amends continuation.pkl from this bundle.
Round 002 bundle
Optimizer continuation example:
configs/rounds/optimize-ic/evidence/bundle/games/code-localization/rounds/round-002/
Adds e.g. policies/next_challenger_policy.round-002.py from game.fakeOptimizer().
Optimizer rounds may also include attempts/attempt-NNN-prompt.txt and attempts/attempt-NNN-result.json for policy-generation retries (separate from live evaluate_n attempt trees).
Observability (observability/)
Optional artifacts emitted alongside canonical bundle files:
| File | Enable | Purpose |
|---|---|---|
observability/eino-evaluator-<role>-<run-id>.ndjson | SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1 | NDJSON mirror of Eino callbacks.Handler timings (on_start / on_end / on_error) for offline debugging next to stdout logs. |
Structured evaluator diagnostics also emit as evaluator.eino.callback events on the round StructuredLog stream (typically stderr JSON alongside provider traces), keyed by match_id / run_id without any hosted tracing dependency.
Immutability
Bundles are not rewritten after completion. New rounds get new directories and ids.