Live MCP evaluation (IC vs jCodeMunch)

Real end-to-end round with Cerebras evaluator, jCodeMunch incumbent, and Iterative Context challenger.

Evidence root (SearchBenchRelease.artifact_root / Buck targets): configs/rounds/live-ic-vs-jcodemunch/evidence/ (matrix + cost siblings under this tree).

Published bundle: configs/rounds/live-ic-vs-jcodemunch/evidence/bundle/games/code-localization/rounds/live-ic-vs-jcodemunch-001/

Not in //:check — live targets require secrets and network.

Buck-only interface

Repo-owned live work uses config-local Buck targets only. See run-entrypoints.md and configs/rounds/live-ic-vs-jcodemunch/README.md.

bash

# Deterministic (no network)
buck2 test //configs/rounds/live-ic-vs-jcodemunch:validate
buck2 test //configs/rounds/live-ic-vs-jcodemunch:validate_bundle

# Dataset export only
buck2 run //configs/rounds/live-ic-vs-jcodemunch:materialize_dataset

# Synthetic/local wiring smoke (MCP + Cerebras + bundle; may use local README row)
buck2 test //configs/rounds/live-ic-vs-jcodemunch:live_smoke

# Real LCA benchmark proof (HF row + materialization + strict graph + provenance)
SEARCHBENCH_RUN_LIVE_E2E=1 buck2 test //configs/rounds/live-ic-vs-jcodemunch:real_lca_smoke

# Repeated live evaluation
buck2 run //configs/rounds/live-ic-vs-jcodemunch:evaluate_n
buck2 run //configs/rounds/live-ic-vs-jcodemunch:stability_probe

Inspect report.json first in the published bundle directory.

Truth model

Mode	Command	Freshness	Proves
`validate_bundle`	`buck2 test`	`archive`	A checked-in completed bundle still validates (no MCP, no model)
`live_smoke`	`buck2 test`	`fresh_live_run`	MCP startup, Cerebras call, bundle write, report validation — not real LCA benchmark proof
`real_lca_smoke`	`buck2 test`	`fresh_live_run`	HF LCA row → JSONL → materialized repo at `base_sha` → strict tree-sitter graph → `LocalizationScorer` → aligned objective/evidence
`evaluate_n`	`buck2 run`	`fresh_live_run`	N stochastic attempts (count from Buck `evaluate_attempts`) aggregated into one top-level bundle; objective uses aggregate evidence (`.mean`)
`evaluate_matrix_10` / `_50`	`buck2 run`	`fresh_live_run`	Many LCA rows with per-match bundles under `<evidence root>/matrix/matrix-001/match-<id>/attempt-001/...` and matrix `aggregate-report.json`
`preflight_matrix_*`	`buck2 run`	—	BXL `cost_plan` + Go `estimate_cost` → `<evidence root>/cost/cost-report.txt` (no model calls)
`estimate_cost`	`buck2 run`	—	Token/USD estimate from `searchbench.cost_plan.v1` JSON; exits non-zero on budget `BLOCK`
`approve_cost`	`buck2 run`	—	Records human approval in `evidence/cost/approval.json` (required before parallel/live execute targets)
`stability_probe`	`buck2 run`	`fresh_live_run`	Repeated same-input attempts; variance metrics only (`decision` = no promotion)
`materialize_dataset`	`buck2 run`	—	Export LCA JSONL only (no round execution)

Deterministic replay is not live proof. Promotion decisions belong in evaluate_n consolidated reports, not in smoke paths.

Historical cost grounding (`estimate_cost`)

Preflight merges evidence roots from historical_sources, artifact_root, <artifact_root>/bundle, and bundle_path (deduped on disk paths, then nested roots pruned so artifact_root + artifact_root/bundle do not double-count the same files), walks **/round-report.json, and sums execution.usage.cost_usd per round when allowed. A parent report.json must classify the run as a live-ish mode (evaluate_matrix, evaluate_n, live_smoke, real_lca_smoke, stability_probe, round_run) with fresh_live_run / fresh freshness (not archive). The estimate scales historical billed USD to the planned matrix matches and attempts_per_match using the sampled round’s spec.matches length (or resolved run lengths) and canonical attempts.count when present. If no qualifying billed samples exist, the tool falls back to token statistics plus optional pricing.pkl from cost_plan.pricing_config.path.

After preflight, run approve_cost so evidence/cost/approval.json matches the digest of cost-estimate.json — parallel matrix and live execute targets refuse to start otherwise (tests-only escape: SEARCHBENCH_SKIP_COST_SPEND_APPROVAL=1).

Provenance artifacts

Real-LCA and live modes record audit fields in report.json, metadata.json, and evidence.pkl:

dataset — kind, HF config/split, source (huggingface or huggingface_reuse)
materialization — per-match base_sha, head_sha (must match), repo_root, cache dir
scoring — localization_scorer, tree_sitter, fake_scorer_used, graph_fallback_used

real_lca_smoke rejects the synthetic becker63/searchbench-go + README.md local row.

Secrets

Only secrets belong in repo-root .env:

bash

CEREBRAS_API_KEY=...
HF_TOKEN=...   # optional; Hugging Face dataset export

Non-secret defaults (manifest path, artifact root, materialize cache, MCP launchers, LCA export skip/max) come from Buck targets and internal/pure/liveconfig.

Reuse an existing exported JSONL without re-downloading:

bash

SEARCHBENCH_SKIP_HF_EXPORT=1 buck2 test //configs/rounds/live-ic-vs-jcodemunch:real_lca_smoke

Workspace seed

The live challenger uses buck_descriptor → //src/iterative-context:optimizable_backend in round.pkl (repo-owned default). Do not use local_path for repo-owned live/eval configs.

Bundle interface

Each completed bundle includes canonical report.json and report.txt — inspect these first — plus round-report.json, evidence, objective, and COMPLETE.

For evaluate_n, raw per-attempt bundles are written under .searchbench/attempts/<round-id>/<run-id>/ (gitignored). The checked-in bundle is aggregate-only (aggregate-report.json, aggregate evidence.pkl, objective from means). See bundles.md.

evaluator.retry.maxAttempts retries failed evaluator executions inside one attempt. evaluate_n intentionally runs multiple full attempts; evaluator.aggregation and evaluator.determinism are configured in Pkl.

Live MCP evaluation (IC vs jCodeMunch) ​

Buck-only interface ​

Truth model ​

Historical cost grounding (estimate_cost) ​

Provenance artifacts ​

Secrets ​

Workspace seed ​

Bundle interface ​

Live MCP evaluation (IC vs jCodeMunch)

Buck-only interface

Truth model

Historical cost grounding (`estimate_cost`)

Provenance artifacts

Secrets

Workspace seed

Bundle interface