Live matrix debugging (operations note)
Operational notes from running evaluate_matrix_* Buck targets (configs/rounds/live-ic-vs-jcodemunch) with structured evaluator logging: what “stalled” logs usually mean, which greps shorten triage, and repeated failure signatures (GitHub MCP, ReAct max iterations, retries).
Audience: engineers tailing stderr JSON logs (--log-format=json) and optional SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1 NDJSON bundles. See bundles.md observability (observability/eino-evaluator-*-*.ndjson).
Targets and rerun semantics
| Buck target | Intent |
|---|---|
evaluate_matrix_5 | Slice of five dataset rows (--dataset-max-items=5): conservative parallelism (see buck/defs/searchbench_round.bzl). |
evaluate_matrix_5_force | Same slice with --matrix-force: re-execute attempts even when prior attempt dirs look complete. Use when you need fresh evaluator traffic without deleting evidence. |
Reusing completed attempts without force skips work and yields evaluate_matrix.match_skipped_cached — stderr will stay quiet aside from coarse matrix lines (no meaningful new evaluator.eino.callback lines).
Higher presets: evaluate_matrix_{10,20,50}.
Turning on observability
- Structured evaluator events on
StructuredLog(stderr JSON lines withevaluator.eino.callbackby default alongside matrix and HTTP logs). - Bundle NDJSON mirror (
observability/in each attempt bundle):
SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1 nix develop -c buck2 run '//configs/rounds/live-ic-vs-jcodemunch:evaluate_matrix_5_force'Why tail -f can look idle (no “new logs”)
evaluate_matrix.match_finishedincludeselapsed_seconds. One match commonly spans minutes (~10–13+ observed) while others finish faster; stderr may go quiet simply because agents are executing tools or LLM rounds without flushing new lines continuously.- Very large JSON log lines. Tool payloads (e.g.
list_reposreturning tens of repos) inflate line length massively; terminals and editors choke or appear frozen while buffering. - Parallel matches (
parallel_matches) interleave bursts from distinctmatch_id/run_id— follow a slice withjq/grep:
grep 'evaluate_matrix\\.match_finished' run.log | jq .
grep '"msg":"evaluator\.eino\.callback"' run.log \
| jq -r 'select(.match_id | test("<your slice>"; "l"))'
# Faster than reading raw:
grep evaluate_matrix\\. run.log | tail -40
grep 'timing":"on_error' run.log
grep -E 'exhausted|retries_exhausted|evaluate_matrix\\.run_finished' run.logHigh-signal log fields vs noise
Prefer these msg prefixes first:
| Prefix | Insight |
|---|---|
evaluate_matrix.run_started / run_finished | force_rerun, match_total, promotion_verdict. |
evaluate_matrix.match_started / match_finished | elapsed_seconds, failures_incumbent / failures_challenger, sample_*_failure. |
evaluator.eino.callback | timing: on_start / on_end / on_error; component (Graph, Chain, ToolsNode, ChatModel…); summaries for truncation. |
provider.http.round_trip | Slow or failed model HTTP (HTML 400, timeouts). |
run.failed / comparison.completed | Rolled-up failure explanations; REVIEW/REJECT_*/PROMOTE posture. |
evaluate_matrix.run_finished may include first_infrastructure_error when the aggregate cannot promote — start there before reading tens of evaluator.eino.callback lines.
Recurring signatures we saw in the wild
These are hypotheses to validate per run; always anchor on timestamps and match_id from your own run.log.
A) MCP index_repo: GitHub 301 Moved Permanently
When evaluator tool summaries show Indexing failed and a 301 on api.github.com/...repos/.../git/trees/HEAD, the MCP path is likely treating a canonical repo URL redirect as failure (no redirect follow).
Impact: incumbent may loop on “index jax again”; combines badly with evaluator turn limits.
Fix direction: follow redirects / use canonical numeric repo URLs in MCP; or ensure local materialize/checkout satisfies indexing without noisy remote tree scans.
B) exceeds max iterations (NodeRunError, run node[ChatModel] pre processor fail)
Repeated on evaluator.eino.callback, timing":"on_error, Graph / ReAct / searchbench_evaluator, challenger and incumbent.
Interpretation: the ReAct/agent graph exited its ChatModel preprocessor loop iteration cap repeatedly; combined with evaluator retry exhaustion (evaluator retries exhausted after 6 attempt(s) surfaced in summaries), comparisons may end REVIEW/REJECT_*/INSUFFICIENT without a crisp winner.
Mitigation knobs (conceptual) — adjust per product/policy; not logging-only fixes:
- Manifest runtime / evaluator bounds (
maxSteps,evaluator.bounds, timeouts) so honest long M+C runs can finish rather than starving mid-graph. - Prompt/tool policy: avoid spirals on flaky tools (above).
- Backend/MCP correctness before raising caps blindly.
C) “Only one completed_both row” summaries
Concurrent matrix may report asymmetric completion (completed_both, challenger vs incumbent counters) when challenger paths fail earlier under the same infra errors—use match_finished per row for ground truth latency and failure excerpts.
D) Cost / spend gates
Spend approval is orthogonal to stderr content. Live execute targets refuse to run without evidence/cost/cost-estimate.json (non-BLOCK) plus matching approve_cost (approval.json digest). Automated tests alone may use SEARCHBENCH_SKIP_COST_SPEND_APPROVAL=1 (see AGENTS.md).
Diagnostic pairwise rounds (isolate layers)
Minimal manifests under configs/rounds/diagnostic-* intentionally change one knob versus the live live-ic-vs-jcodemunch baseline so you can A/B report.json + stderr prefixes without matrix-scale noise.
| Round package | Switch | Compared with |
|---|---|---|
diagnostic-fake-eval-live-players | Fake Evaluator (no Cerebras/evaluator.eino.callback traffic on Comparison) — real Incumbent + Challenger. | //configs/rounds/live-ic-vs-jcodemunch:live_smoke (same default dataset flags): isolates evaluator vs Evidence capture. |
diagnostic-incumbent-fake-eval-ic | Fake IncumbentPolicy (stub-incumbent) + real IC Challenger + fake Evaluator. | diagnostic-fake-eval-live-players (both real agents, fake evaluator): separates IC-only MCP churn from incumbent + challenger interplay. |
See each README.md for interpretation notes (configs/rounds/diagnostic-fake-eval-live-players/README.md, configs/rounds/diagnostic-incumbent-fake-eval-ic/README.md).
GitHub epic repro rounds (#114 stack — #115–#119)
Each numbered issue owns a configs/rounds/diagnostic-issue-* Buck package (validate, live_smoke, evaluate_n, real_lca_smoke, evaluate_matrix_*). README.md spells targets, expected proof, canonical bundles, plus buck2 … :validate_bundle for offline completeness (AGENTS.md).
| GitHub issue | Round package | Canonical single-plane …/games/code-localization/rounds/ basename |
|---|---|---|
| #115 | diagnostic-issue-115-evaluator-scope (prefer :evaluate_n) | diagnostic-issue-115-eval-scope-001/ |
| #116 | diagnostic-issue-116-round-bundle-id (:live_smoke / :evaluate_n) | diagnostic-issue-116-round-bundle-001/ |
| #117 | diagnostic-issue-117-matrix-observability (matrix + traces) | diagnostic-issue-117-matrix-trace-001/ + matrix artefacts under evidence/matrix/ |
| #118 | diagnostic-issue-118-ic-anchor (:evaluate_n) | diagnostic-issue-118-ic-anchor-001/ |
| #119 | diagnostic-issue-119-proof-modes (modes on one manifest) | diagnostic-issue-119-proof-modes-001/ ⚠ rewired between runs |
Related docs & agents
| Resource | Topic |
|---|---|
| bundles.md | Bundle observability/ NDJSON contract. |
| run-entrypoints.md | Spend gates and Buck entry semantics. |
configs/rounds/live-ic-vs-jcodemunch/README.md | Target table including evaluate_matrix_5* presets. |
configs/rounds/diagnostic-fake-eval-live-players/README.md | Pair live_smoke vs fake-eval comparator. |
configs/rounds/diagnostic-incumbent-fake-eval-ic/README.md | IC-only MCP isolate (stub incumbent). |
Cursor IDE: .cursor/ is intentionally gitignored in this repo. If you use a Cursor subagent markdown that mirrors these greps/workflows, duplicate it privately under ~/.cursor/agents/ or keep prompts in src/docs (this file).
Quick checklist before filing an infra issue
- Grepped
evaluate_matrix\\.run_finishedandmatch_finished(elapsed_seconds,first_infrastructure_error). - Confirmed
--matrix-forceif you intended a fresh rerun (else cache skips). - Separated
on_error/max iterationsbursts from intermittent HTTP outages (provider.http.round_trip). - Redacted
list_repos/ MCP payloads when sharing excerpts (they enumerate local workspaces).