Skip to content

Live MCP evaluation (IC vs jCodeMunch)

Real end-to-end round with Cerebras evaluator, jCodeMunch incumbent, and Iterative Context challenger.

Evidence root (SearchBenchRelease.artifact_root / Buck targets): configs/rounds/live-ic-vs-jcodemunch/evidence/ (matrix + cost siblings under this tree).

Published bundle: configs/rounds/live-ic-vs-jcodemunch/evidence/bundle/games/code-localization/rounds/live-ic-vs-jcodemunch-001/

Not in //:check — live targets require secrets and network.

Buck-only interface

Repo-owned live work uses config-local Buck targets only. See run-entrypoints.md and configs/rounds/live-ic-vs-jcodemunch/README.md.

bash
# Deterministic (no network)
buck2 test //configs/rounds/live-ic-vs-jcodemunch:validate
buck2 test //configs/rounds/live-ic-vs-jcodemunch:validate_bundle

# Dataset export only
buck2 run //configs/rounds/live-ic-vs-jcodemunch:materialize_dataset

# Synthetic/local wiring smoke (MCP + Cerebras + bundle; may use local README row)
buck2 test //configs/rounds/live-ic-vs-jcodemunch:live_smoke

# Real LCA benchmark proof (HF row + materialization + strict graph + provenance)
SEARCHBENCH_RUN_LIVE_E2E=1 buck2 test //configs/rounds/live-ic-vs-jcodemunch:real_lca_smoke

# Repeated live evaluation
buck2 run //configs/rounds/live-ic-vs-jcodemunch:evaluate_n
buck2 run //configs/rounds/live-ic-vs-jcodemunch:stability_probe

Inspect report.json first in the published bundle directory.

Truth model

ModeCommandFreshnessProves
validate_bundlebuck2 testarchiveA checked-in completed bundle still validates (no MCP, no model)
live_smokebuck2 testfresh_live_runMCP startup, Cerebras call, bundle write, report validation — not real LCA benchmark proof
real_lca_smokebuck2 testfresh_live_runHF LCA row → JSONL → materialized repo at base_sha → strict tree-sitter graph → LocalizationScorer → aligned objective/evidence
evaluate_nbuck2 runfresh_live_runN stochastic attempts (count from Buck evaluate_attempts) aggregated into one top-level bundle; objective uses aggregate evidence (.mean)
evaluate_matrix_10 / _50buck2 runfresh_live_runMany LCA rows with per-match bundles under <evidence root>/matrix/matrix-001/match-<id>/attempt-001/... and matrix aggregate-report.json
preflight_matrix_*buck2 runBXL cost_plan + Go estimate_cost<evidence root>/cost/cost-report.txt (no model calls)
estimate_costbuck2 runToken/USD estimate from searchbench.cost_plan.v1 JSON; exits non-zero on budget BLOCK
approve_costbuck2 runRecords human approval in evidence/cost/approval.json (required before parallel/live execute targets)
stability_probebuck2 runfresh_live_runRepeated same-input attempts; variance metrics only (decision = no promotion)
materialize_datasetbuck2 runExport LCA JSONL only (no round execution)

Deterministic replay is not live proof. Promotion decisions belong in evaluate_n consolidated reports, not in smoke paths.

Historical cost grounding (estimate_cost)

Preflight merges evidence roots from historical_sources, artifact_root, <artifact_root>/bundle, and bundle_path (deduped on disk paths, then nested roots pruned so artifact_root + artifact_root/bundle do not double-count the same files), walks **/round-report.json, and sums execution.usage.cost_usd per round when allowed. A parent report.json must classify the run as a live-ish mode (evaluate_matrix, evaluate_n, live_smoke, real_lca_smoke, stability_probe, round_run) with fresh_live_run / fresh freshness (not archive). The estimate scales historical billed USD to the planned matrix matches and attempts_per_match using the sampled round’s spec.matches length (or resolved run lengths) and canonical attempts.count when present. If no qualifying billed samples exist, the tool falls back to token statistics plus optional pricing.pkl from cost_plan.pricing_config.path.

After preflight, run approve_cost so evidence/cost/approval.json matches the digest of cost-estimate.json — parallel matrix and live execute targets refuse to start otherwise (tests-only escape: SEARCHBENCH_SKIP_COST_SPEND_APPROVAL=1).

Provenance artifacts

Real-LCA and live modes record audit fields in report.json, metadata.json, and evidence.pkl:

  • dataset — kind, HF config/split, source (huggingface or huggingface_reuse)
  • materialization — per-match base_sha, head_sha (must match), repo_root, cache dir
  • scoringlocalization_scorer, tree_sitter, fake_scorer_used, graph_fallback_used

real_lca_smoke rejects the synthetic becker63/searchbench-go + README.md local row.

Secrets

Only secrets belong in repo-root .env:

bash
CEREBRAS_API_KEY=...
HF_TOKEN=...   # optional; Hugging Face dataset export

Non-secret defaults (manifest path, artifact root, materialize cache, MCP launchers, LCA export skip/max) come from Buck targets and internal/pure/liveconfig.

Reuse an existing exported JSONL without re-downloading:

bash
SEARCHBENCH_SKIP_HF_EXPORT=1 buck2 test //configs/rounds/live-ic-vs-jcodemunch:real_lca_smoke

Workspace seed

The live challenger uses buck_descriptor//src/iterative-context:optimizable_backend in round.pkl (repo-owned default). Do not use local_path for repo-owned live/eval configs.

Bundle interface

Each completed bundle includes canonical report.json and report.txt — inspect these first — plus round-report.json, evidence, objective, and COMPLETE.

For evaluate_n, raw per-attempt bundles are written under .searchbench/attempts/<round-id>/<run-id>/ (gitignored). The checked-in bundle is aggregate-only (aggregate-report.json, aggregate evidence.pkl, objective from means). See bundles.md.

evaluator.retry.maxAttempts retries failed evaluator executions inside one attempt. evaluate_n intentionally runs multiple full attempts; evaluator.aggregation and evaluator.determinism are configured in Pkl.