Live matrix debugging (operations note)

Operational notes from running evaluate_matrix_* Buck targets (configs/rounds/live-ic-vs-jcodemunch) with structured evaluator logging: what “stalled” logs usually mean, which greps shorten triage, and repeated failure signatures (GitHub MCP, ReAct max iterations, retries).

Audience: engineers tailing stderr JSON logs (--log-format=json) and optional SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1 NDJSON bundles. See bundles.md observability (observability/eino-evaluator-*-*.ndjson).

Targets and rerun semantics

Buck target	Intent
`evaluate_matrix_5`	Slice of five dataset rows (`--dataset-max-items=5`): conservative parallelism (see `buck/defs/searchbench_round.bzl`).
`evaluate_matrix_5_force`	Same slice with `--matrix-force`: re-execute attempts even when prior attempt dirs look complete. Use when you need fresh evaluator traffic without deleting evidence.

Reusing completed attempts without force skips work and yields evaluate_matrix.match_skipped_cached — stderr will stay quiet aside from coarse matrix lines (no meaningful new evaluator.eino.callback lines).

Higher presets: evaluate_matrix_{10,20,50}.

Turning on observability

Structured evaluator events on StructuredLog (stderr JSON lines with evaluator.eino.callback by default alongside matrix and HTTP logs).
Bundle NDJSON mirror (observability/ in each attempt bundle):

bash

SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1 nix develop -c buck2 run '//configs/rounds/live-ic-vs-jcodemunch:evaluate_matrix_5_force'

Why `tail -f` can look idle (no “new logs”)

evaluate_matrix.match_finished includes elapsed_seconds. One match commonly spans minutes (~10–13+ observed) while others finish faster; stderr may go quiet simply because agents are executing tools or LLM rounds without flushing new lines continuously.
Very large JSON log lines. Tool payloads (e.g. list_repos returning tens of repos) inflate line length massively; terminals and editors choke or appear frozen while buffering.
Parallel matches (parallel_matches) interleave bursts from distinct match_id / run_id — follow a slice with jq/grep:

bash

grep 'evaluate_matrix\\.match_finished' run.log | jq .

grep '"msg":"evaluator\.eino\.callback"' run.log \
  | jq -r 'select(.match_id | test("<your slice>"; "l"))'

# Faster than reading raw:
grep evaluate_matrix\\. run.log | tail -40
grep 'timing":"on_error' run.log
grep -E 'exhausted|retries_exhausted|evaluate_matrix\\.run_finished' run.log

High-signal log fields vs noise

Prefer these msg prefixes first:

Prefix	Insight
`evaluate_matrix.run_started` / `run_finished`	`force_rerun`, `match_total`, `promotion_verdict`.
`evaluate_matrix.match_started` / `match_finished`	`elapsed_seconds`, `failures_incumbent` / `failures_challenger`, *`sample__failure`**.
`evaluator.eino.callback`	`timing`: `on_start` / `on_end` / `on_error`; `component` (Graph, Chain, ToolsNode, ChatModel…); summaries for truncation.
`provider.http.round_trip`	Slow or failed model HTTP (HTML 400, timeouts).
`run.failed` / `comparison.completed`	Rolled-up failure explanations; `REVIEW`/`REJECT_*`/`PROMOTE` posture.

evaluate_matrix.run_finished may include first_infrastructure_error when the aggregate cannot promote — start there before reading tens of evaluator.eino.callback lines.

Recurring signatures we saw in the wild

These are hypotheses to validate per run; always anchor on timestamps and match_id from your own run.log.

A) MCP `index_repo`: GitHub 301 Moved Permanently

When evaluator tool summaries show Indexing failed and a 301 on api.github.com/...repos/.../git/trees/HEAD, the MCP path is likely treating a canonical repo URL redirect as failure (no redirect follow).

Impact: incumbent may loop on “index jax again”; combines badly with evaluator turn limits.

Fix direction: follow redirects / use canonical numeric repo URLs in MCP; or ensure local materialize/checkout satisfies indexing without noisy remote tree scans.

B) `exceeds max iterations` (`NodeRunError`, `run node[ChatModel] pre processor fail`)

Repeated on evaluator.eino.callback, timing":"on_error, Graph / ReAct / searchbench_evaluator, challenger and incumbent.

Interpretation: the ReAct/agent graph exited its ChatModel preprocessor loop iteration cap repeatedly; combined with evaluator retry exhaustion (evaluator retries exhausted after 6 attempt(s) surfaced in summaries), comparisons may end REVIEW/REJECT_*/INSUFFICIENT without a crisp winner.

Mitigation knobs (conceptual) — adjust per product/policy; not logging-only fixes:

Manifest runtime / evaluator bounds (maxSteps, evaluator.bounds, timeouts) so honest long M+C runs can finish rather than starving mid-graph.
Prompt/tool policy: avoid spirals on flaky tools (above).
Backend/MCP correctness before raising caps blindly.

C) “Only one `completed_both` row” summaries

Concurrent matrix may report asymmetric completion (completed_both, challenger vs incumbent counters) when challenger paths fail earlier under the same infra errors—use match_finished per row for ground truth latency and failure excerpts.

D) Cost / spend gates

Spend approval is orthogonal to stderr content. Live execute targets refuse to run without evidence/cost/cost-estimate.json (non-BLOCK) plus matching approve_cost (approval.json digest). Automated tests alone may use SEARCHBENCH_SKIP_COST_SPEND_APPROVAL=1 (see AGENTS.md).

Diagnostic pairwise rounds (isolate layers)

Minimal manifests under configs/rounds/diagnostic-* intentionally change one knob versus the live live-ic-vs-jcodemunch baseline so you can A/B report.json + stderr prefixes without matrix-scale noise.

Round package	Switch	Compared with
`diagnostic-fake-eval-live-players`	Fake `Evaluator` (no Cerebras/`evaluator.eino.callback` traffic on Comparison) — real Incumbent + Challenger.	`//configs/rounds/live-ic-vs-jcodemunch:live_smoke` (same default dataset flags): isolates evaluator vs Evidence capture.
`diagnostic-incumbent-fake-eval-ic`	Fake `IncumbentPolicy` (stub-incumbent) + real IC Challenger + fake `Evaluator`.	`diagnostic-fake-eval-live-players` (both real agents, fake evaluator): separates IC-only MCP churn from incumbent + challenger interplay.

See each README.md for interpretation notes (configs/rounds/diagnostic-fake-eval-live-players/README.md, configs/rounds/diagnostic-incumbent-fake-eval-ic/README.md).

GitHub epic repro rounds (#114 stack — #115–#119)

Each numbered issue owns a configs/rounds/diagnostic-issue-* Buck package (validate, live_smoke, evaluate_n, real_lca_smoke, evaluate_matrix_*). README.md spells targets, expected proof, canonical bundles, plus buck2 … :validate_bundle for offline completeness (AGENTS.md).

GitHub issue	Round package	Canonical single-plane `…/games/code-localization/rounds/` basename
#115	`diagnostic-issue-115-evaluator-scope` (prefer `:evaluate_n`)	`diagnostic-issue-115-eval-scope-001/`
#116	`diagnostic-issue-116-round-bundle-id` (`:live_smoke` / `:evaluate_n`)	`diagnostic-issue-116-round-bundle-001/`
#117	`diagnostic-issue-117-matrix-observability` (matrix + traces)	`diagnostic-issue-117-matrix-trace-001/` + matrix artefacts under `evidence/matrix/`
#118	`diagnostic-issue-118-ic-anchor` (`:evaluate_n`)	`diagnostic-issue-118-ic-anchor-001/`
#119	`diagnostic-issue-119-proof-modes` (modes on one manifest)	`diagnostic-issue-119-proof-modes-001/` ⚠ rewired between runs

Resource	Topic
bundles.md	Bundle `observability/` NDJSON contract.
run-entrypoints.md	Spend gates and Buck entry semantics.
`configs/rounds/live-ic-vs-jcodemunch/README.md`	Target table including *`evaluate_matrix_5`** presets.
`configs/rounds/diagnostic-fake-eval-live-players/README.md`	Pair `live_smoke` vs fake-eval comparator.
`configs/rounds/diagnostic-incumbent-fake-eval-ic/README.md`	IC-only MCP isolate (stub incumbent).

Cursor IDE: .cursor/ is intentionally gitignored in this repo. If you use a Cursor subagent markdown that mirrors these greps/workflows, duplicate it privately under ~/.cursor/agents/ or keep prompts in src/docs (this file).

Quick checklist before filing an infra issue

Grepped evaluate_matrix\\.run_finished and match_finished (elapsed_seconds, first_infrastructure_error).
Confirmed --matrix-force if you intended a fresh rerun (else cache skips).
Separated on_error / max iterations bursts from intermittent HTTP outages (provider.http.round_trip).
Redacted list_repos / MCP payloads when sharing excerpts (they enumerate local workspaces).

Live matrix debugging (operations note) ​

Targets and rerun semantics ​

Turning on observability ​

Why tail -f can look idle (no “new logs”) ​

High-signal log fields vs noise ​

Recurring signatures we saw in the wild ​

A) MCP index_repo: GitHub 301 Moved Permanently ​

B) exceeds max iterations (NodeRunError, run node[ChatModel] pre processor fail) ​

C) “Only one completed_both row” summaries ​

D) Cost / spend gates ​

Diagnostic pairwise rounds (isolate layers) ​

GitHub epic repro rounds (#114 stack — #115–#119) ​

Related docs & agents ​

Quick checklist before filing an infra issue ​