Bundles

A bundle is the durable artifact tree for one completed round. It is the product output reviewers and tools inspect.

Write path: src/searchbench-go/internal/adapters/bundle/fsModels: src/searchbench-go/internal/pure/report, src/searchbench-go/internal/pure/score

First files to inspect

Open report.json, then report.txt. These are the canonical human/agent summaries (mode, freshness, pass/fail, failure counts, attempt aggregates). Detailed evidence lives in round-report.json and related files.

Example tree

Path: configs/rounds/local-ic-vs-jcodemunch/evidence/bundle/games/code-localization/rounds/round-001/

The same shape appears under {bundle-root}/games/code-localization/rounds/<round-id>/ after a repo-owned Buck run.

text

COMPLETE
report.json
report.txt
resolved-round.json
round-report.json
round-report.txt
evidence.pkl
objective.json
decision.json
metadata.json
continuation.json
continuation.pkl
policies/challenger_policy.py

File	Role
`COMPLETE`	Marker that the round finished
`report.json` / `report.txt`	Canonical summary — inspect first
`resolved-round.json`	Fully resolved manifest + config snapshot
`round-report.json` / `.txt`	Detailed comparison report (evidence-level)
`evidence.pkl`	Evidence document for Pkl scoring
`objective.json`	Result of `localization-objective.pkl`
`decision.json`	`PROMOTE_CHALLENGER` / `REVIEW` / `REJECT` (or `NO_DECISION` for stability probes)
`metadata.json`	Bundle ids, hashes, provenance
`continuation.json` / `.pkl`	Survivor state for the next round manifest
`policies/`	Staged challenger (and related) policy files

`evaluate_n` aggregate bundle

Multi-attempt evaluation (buck2 run //configs/rounds/live-ic-vs-jcodemunch:evaluate_n) writes one aggregate checked-in bundle. Attempt count comes from the Buck target (evaluate_attempts); aggregation policy and provider determinism come from Pkl (evaluator.aggregation, evaluator.determinism).

text

report.json              # human summary + attempt pass rates
aggregate-report.json    # per-metric mean/median/stddev/best/worst
round-report.json        # aggregate comparisons (not last attempt)
evidence.pkl             # aggregate evidence (e.g. goldHop.challenger.mean)
objective.json           # Pkl objective over aggregate evidence
metadata.json            # evaluation / aggregation / determinism provenance
COMPLETE

Raw per-attempt bundles are optional local debug output under .searchbench/attempts/<round-id>/<run-id>/attempt-NNN/ (gitignored). They are not required in the published bundle.

Metric rows in aggregate-report.json include count, mean, median, stddev, best, worst, and optional per-attempt values when enabled in Pkl.

Short excerpts

decision.json:

json

{
  "decision": "PROMOTE_CHALLENGER",
  "reason": "challenger improves the composite score in local fake comparison"
}

continuation.json (start):

json

{
  "schema_version": "searchbench.continuation.v1",
  "bundle_id": "round-001",
  "game": { "id": "code-localization", "kind": "code_localization" }
}

Next round input: configs/rounds/optimize-ic/round.pkl amends continuation.pkl from this bundle.

Round 002 bundle

Optimizer continuation example:

configs/rounds/optimize-ic/evidence/bundle/games/code-localization/rounds/round-002/

Adds e.g. policies/next_challenger_policy.round-002.py from game.fakeOptimizer().

Optimizer rounds may also include attempts/attempt-NNN-prompt.txt and attempts/attempt-NNN-result.json for policy-generation retries (separate from live evaluate_n attempt trees).

Observability (`observability/`)

Optional artifacts emitted alongside canonical bundle files:

File	Enable	Purpose
`observability/eino-evaluator-<role>-<run-id>.ndjson`	`SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1`	NDJSON mirror of Eino `callbacks.Handler` timings (`on_start` / `on_end` / `on_error`) for offline debugging next to stdout logs.

Structured evaluator diagnostics also emit as evaluator.eino.callback events on the round StructuredLog stream (typically stderr JSON alongside provider traces), keyed by match_id / run_id without any hosted tracing dependency.

Immutability

Bundles are not rewritten after completion. New rounds get new directories and ids.

Bundles ​

First files to inspect ​

Example tree ​

evaluate_n aggregate bundle ​

Short excerpts ​

Round 002 bundle ​

Observability (observability/) ​

Immutability ​