Skip to content

Bundles

A bundle is the durable artifact tree for one completed round. It is the product output reviewers and tools inspect.

Write path: src/searchbench-go/internal/adapters/bundle/fsModels: src/searchbench-go/internal/pure/report, src/searchbench-go/internal/pure/score

First files to inspect

Open report.json, then report.txt. These are the canonical human/agent summaries (mode, freshness, pass/fail, failure counts, attempt aggregates). Detailed evidence lives in round-report.json and related files.

Example tree

Path: configs/rounds/local-ic-vs-jcodemunch/evidence/bundle/games/code-localization/rounds/round-001/

The same shape appears under {bundle-root}/games/code-localization/rounds/<round-id>/ after a repo-owned Buck run.

text
COMPLETE
report.json
report.txt
resolved-round.json
round-report.json
round-report.txt
evidence.pkl
objective.json
decision.json
metadata.json
continuation.json
continuation.pkl
policies/challenger_policy.py
FileRole
COMPLETEMarker that the round finished
report.json / report.txtCanonical summary — inspect first
resolved-round.jsonFully resolved manifest + config snapshot
round-report.json / .txtDetailed comparison report (evidence-level)
evidence.pklEvidence document for Pkl scoring
objective.jsonResult of localization-objective.pkl
decision.jsonPROMOTE_CHALLENGER / REVIEW / REJECT (or NO_DECISION for stability probes)
metadata.jsonBundle ids, hashes, provenance
continuation.json / .pklSurvivor state for the next round manifest
policies/Staged challenger (and related) policy files

evaluate_n aggregate bundle

Multi-attempt evaluation (buck2 run //configs/rounds/live-ic-vs-jcodemunch:evaluate_n) writes one aggregate checked-in bundle. Attempt count comes from the Buck target (evaluate_attempts); aggregation policy and provider determinism come from Pkl (evaluator.aggregation, evaluator.determinism).

text
report.json              # human summary + attempt pass rates
aggregate-report.json    # per-metric mean/median/stddev/best/worst
round-report.json        # aggregate comparisons (not last attempt)
evidence.pkl             # aggregate evidence (e.g. goldHop.challenger.mean)
objective.json           # Pkl objective over aggregate evidence
metadata.json            # evaluation / aggregation / determinism provenance
COMPLETE

Raw per-attempt bundles are optional local debug output under .searchbench/attempts/<round-id>/<run-id>/attempt-NNN/ (gitignored). They are not required in the published bundle.

Metric rows in aggregate-report.json include count, mean, median, stddev, best, worst, and optional per-attempt values when enabled in Pkl.

Short excerpts

decision.json:

json
{
  "decision": "PROMOTE_CHALLENGER",
  "reason": "challenger improves the composite score in local fake comparison"
}

continuation.json (start):

json
{
  "schema_version": "searchbench.continuation.v1",
  "bundle_id": "round-001",
  "game": { "id": "code-localization", "kind": "code_localization" }
}

Next round input: configs/rounds/optimize-ic/round.pkl amends continuation.pkl from this bundle.

Round 002 bundle

Optimizer continuation example:

configs/rounds/optimize-ic/evidence/bundle/games/code-localization/rounds/round-002/

Adds e.g. policies/next_challenger_policy.round-002.py from game.fakeOptimizer().

Optimizer rounds may also include attempts/attempt-NNN-prompt.txt and attempts/attempt-NNN-result.json for policy-generation retries (separate from live evaluate_n attempt trees).

Observability (observability/)

Optional artifacts emitted alongside canonical bundle files:

FileEnablePurpose
observability/eino-evaluator-<role>-<run-id>.ndjsonSEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1NDJSON mirror of Eino callbacks.Handler timings (on_start / on_end / on_error) for offline debugging next to stdout logs.

Structured evaluator diagnostics also emit as evaluator.eino.callback events on the round StructuredLog stream (typically stderr JSON alongside provider traces), keyed by match_id / run_id without any hosted tracing dependency.

Immutability

Bundles are not rewritten after completion. New rounds get new directories and ids.