Live MCP evaluation (IC vs jCodeMunch)
Real end-to-end round with Cerebras evaluator, jCodeMunch incumbent, and Iterative Context challenger.
Evidence root (SearchBenchRelease.artifact_root / Buck targets): configs/rounds/live-ic-vs-jcodemunch/evidence/ (matrix + cost siblings under this tree).
Published bundle: configs/rounds/live-ic-vs-jcodemunch/evidence/bundle/games/code-localization/rounds/live-ic-vs-jcodemunch-001/
Not in //:check — live targets require secrets and network.
Buck-only interface
Repo-owned live work uses config-local Buck targets only. See run-entrypoints.md and configs/rounds/live-ic-vs-jcodemunch/README.md.
# Deterministic (no network)
buck2 test //configs/rounds/live-ic-vs-jcodemunch:validate
buck2 test //configs/rounds/live-ic-vs-jcodemunch:validate_bundle
# Dataset export only
buck2 run //configs/rounds/live-ic-vs-jcodemunch:materialize_dataset
# Synthetic/local wiring smoke (MCP + Cerebras + bundle; may use local README row)
buck2 test //configs/rounds/live-ic-vs-jcodemunch:live_smoke
# Real LCA benchmark proof (HF row + materialization + strict graph + provenance)
SEARCHBENCH_RUN_LIVE_E2E=1 buck2 test //configs/rounds/live-ic-vs-jcodemunch:real_lca_smoke
# Repeated live evaluation
buck2 run //configs/rounds/live-ic-vs-jcodemunch:evaluate_n
buck2 run //configs/rounds/live-ic-vs-jcodemunch:stability_probeInspect report.json first in the published bundle directory.
Truth model
| Mode | Command | Freshness | Proves |
|---|---|---|---|
validate_bundle | buck2 test | archive | A checked-in completed bundle still validates (no MCP, no model) |
live_smoke | buck2 test | fresh_live_run | MCP startup, Cerebras call, bundle write, report validation — not real LCA benchmark proof |
real_lca_smoke | buck2 test | fresh_live_run | HF LCA row → JSONL → materialized repo at base_sha → strict tree-sitter graph → LocalizationScorer → aligned objective/evidence |
evaluate_n | buck2 run | fresh_live_run | N stochastic attempts (count from Buck evaluate_attempts) aggregated into one top-level bundle; objective uses aggregate evidence (.mean) |
evaluate_matrix_10 / _50 | buck2 run | fresh_live_run | Many LCA rows with per-match bundles under <evidence root>/matrix/matrix-001/match-<id>/attempt-001/... and matrix aggregate-report.json |
preflight_matrix_* | buck2 run | — | BXL cost_plan + Go estimate_cost → <evidence root>/cost/cost-report.txt (no model calls) |
estimate_cost | buck2 run | — | Token/USD estimate from searchbench.cost_plan.v1 JSON; exits non-zero on budget BLOCK |
approve_cost | buck2 run | — | Records human approval in evidence/cost/approval.json (required before parallel/live execute targets) |
stability_probe | buck2 run | fresh_live_run | Repeated same-input attempts; variance metrics only (decision = no promotion) |
materialize_dataset | buck2 run | — | Export LCA JSONL only (no round execution) |
Deterministic replay is not live proof. Promotion decisions belong in evaluate_n consolidated reports, not in smoke paths.
Historical cost grounding (estimate_cost)
Preflight merges evidence roots from historical_sources, artifact_root, <artifact_root>/bundle, and bundle_path (deduped on disk paths, then nested roots pruned so artifact_root + artifact_root/bundle do not double-count the same files), walks **/round-report.json, and sums execution.usage.cost_usd per round when allowed. A parent report.json must classify the run as a live-ish mode (evaluate_matrix, evaluate_n, live_smoke, real_lca_smoke, stability_probe, round_run) with fresh_live_run / fresh freshness (not archive). The estimate scales historical billed USD to the planned matrix matches and attempts_per_match using the sampled round’s spec.matches length (or resolved run lengths) and canonical attempts.count when present. If no qualifying billed samples exist, the tool falls back to token statistics plus optional pricing.pkl from cost_plan.pricing_config.path.
After preflight, run approve_cost so evidence/cost/approval.json matches the digest of cost-estimate.json — parallel matrix and live execute targets refuse to start otherwise (tests-only escape: SEARCHBENCH_SKIP_COST_SPEND_APPROVAL=1).
Provenance artifacts
Real-LCA and live modes record audit fields in report.json, metadata.json, and evidence.pkl:
dataset— kind, HF config/split,source(huggingfaceorhuggingface_reuse)materialization— per-matchbase_sha,head_sha(must match),repo_root, cache dirscoring—localization_scorer,tree_sitter,fake_scorer_used,graph_fallback_used
real_lca_smoke rejects the synthetic becker63/searchbench-go + README.md local row.
Secrets
Only secrets belong in repo-root .env:
CEREBRAS_API_KEY=...
HF_TOKEN=... # optional; Hugging Face dataset exportNon-secret defaults (manifest path, artifact root, materialize cache, MCP launchers, LCA export skip/max) come from Buck targets and internal/pure/liveconfig.
Reuse an existing exported JSONL without re-downloading:
SEARCHBENCH_SKIP_HF_EXPORT=1 buck2 test //configs/rounds/live-ic-vs-jcodemunch:real_lca_smokeWorkspace seed
The live challenger uses buck_descriptor → //src/iterative-context:optimizable_backend in round.pkl (repo-owned default). Do not use local_path for repo-owned live/eval configs.
Bundle interface
Each completed bundle includes canonical report.json and report.txt — inspect these first — plus round-report.json, evidence, objective, and COMPLETE.
For evaluate_n, raw per-attempt bundles are written under .searchbench/attempts/<round-id>/<run-id>/ (gitignored). The checked-in bundle is aggregate-only (aggregate-report.json, aggregate evidence.pkl, objective from means). See bundles.md.
evaluator.retry.maxAttempts retries failed evaluator executions inside one attempt. evaluate_n intentionally runs multiple full attempts; evaluator.aggregation and evaluator.determinism are configured in Pkl.