Skip to content

Live matrix debugging (operations note)

Operational notes from running evaluate_matrix_* Buck targets (configs/rounds/live-ic-vs-jcodemunch) with structured evaluator logging: what “stalled” logs usually mean, which greps shorten triage, and repeated failure signatures (GitHub MCP, ReAct max iterations, retries).

Audience: engineers tailing stderr JSON logs (--log-format=json) and optional SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1 NDJSON bundles. See bundles.md observability (observability/eino-evaluator-*-*.ndjson).

Targets and rerun semantics

Buck targetIntent
evaluate_matrix_5Slice of five dataset rows (--dataset-max-items=5): conservative parallelism (see buck/defs/searchbench_round.bzl).
evaluate_matrix_5_forceSame slice with --matrix-force: re-execute attempts even when prior attempt dirs look complete. Use when you need fresh evaluator traffic without deleting evidence.

Reusing completed attempts without force skips work and yields evaluate_matrix.match_skipped_cached — stderr will stay quiet aside from coarse matrix lines (no meaningful new evaluator.eino.callback lines).

Higher presets: evaluate_matrix_{10,20,50}.

Turning on observability

  • Structured evaluator events on StructuredLog (stderr JSON lines with evaluator.eino.callback by default alongside matrix and HTTP logs).
  • Bundle NDJSON mirror (observability/ in each attempt bundle):
bash
SEARCHBENCH_BUNDLE_EVALUATOR_TRACE=1 nix develop -c buck2 run '//configs/rounds/live-ic-vs-jcodemunch:evaluate_matrix_5_force'

Why tail -f can look idle (no “new logs”)

  1. evaluate_matrix.match_finished includes elapsed_seconds. One match commonly spans minutes (~10–13+ observed) while others finish faster; stderr may go quiet simply because agents are executing tools or LLM rounds without flushing new lines continuously.
  2. Very large JSON log lines. Tool payloads (e.g. list_repos returning tens of repos) inflate line length massively; terminals and editors choke or appear frozen while buffering.
  3. Parallel matches (parallel_matches) interleave bursts from distinct match_id / run_id — follow a slice with jq/grep:
bash
grep 'evaluate_matrix\\.match_finished' run.log | jq .

grep '"msg":"evaluator\.eino\.callback"' run.log \
  | jq -r 'select(.match_id | test("<your slice>"; "l"))'

# Faster than reading raw:
grep evaluate_matrix\\. run.log | tail -40
grep 'timing":"on_error' run.log
grep -E 'exhausted|retries_exhausted|evaluate_matrix\\.run_finished' run.log

High-signal log fields vs noise

Prefer these msg prefixes first:

PrefixInsight
evaluate_matrix.run_started / run_finishedforce_rerun, match_total, promotion_verdict.
evaluate_matrix.match_started / match_finishedelapsed_seconds, failures_incumbent / failures_challenger, sample_*_failure.
evaluator.eino.callbacktiming: on_start / on_end / on_error; component (Graph, Chain, ToolsNode, ChatModel…); summaries for truncation.
provider.http.round_tripSlow or failed model HTTP (HTML 400, timeouts).
run.failed / comparison.completedRolled-up failure explanations; REVIEW/REJECT_*/PROMOTE posture.

evaluate_matrix.run_finished may include first_infrastructure_error when the aggregate cannot promote — start there before reading tens of evaluator.eino.callback lines.

Recurring signatures we saw in the wild

These are hypotheses to validate per run; always anchor on timestamps and match_id from your own run.log.

A) MCP index_repo: GitHub 301 Moved Permanently

When evaluator tool summaries show Indexing failed and a 301 on api.github.com/...repos/.../git/trees/HEAD, the MCP path is likely treating a canonical repo URL redirect as failure (no redirect follow).

Impact: incumbent may loop on “index jax again”; combines badly with evaluator turn limits.

Fix direction: follow redirects / use canonical numeric repo URLs in MCP; or ensure local materialize/checkout satisfies indexing without noisy remote tree scans.

B) exceeds max iterations (NodeRunError, run node[ChatModel] pre processor fail)

Repeated on evaluator.eino.callback, timing":"on_error, Graph / ReAct / searchbench_evaluator, challenger and incumbent.

Interpretation: the ReAct/agent graph exited its ChatModel preprocessor loop iteration cap repeatedly; combined with evaluator retry exhaustion (evaluator retries exhausted after 6 attempt(s) surfaced in summaries), comparisons may end REVIEW/REJECT_*/INSUFFICIENT without a crisp winner.

Mitigation knobs (conceptual) — adjust per product/policy; not logging-only fixes:

  • Manifest runtime / evaluator bounds (maxSteps, evaluator.bounds, timeouts) so honest long M+C runs can finish rather than starving mid-graph.
  • Prompt/tool policy: avoid spirals on flaky tools (above).
  • Backend/MCP correctness before raising caps blindly.

C) “Only one completed_both row” summaries

Concurrent matrix may report asymmetric completion (completed_both, challenger vs incumbent counters) when challenger paths fail earlier under the same infra errors—use match_finished per row for ground truth latency and failure excerpts.

D) Cost / spend gates

Spend approval is orthogonal to stderr content. Live execute targets refuse to run without evidence/cost/cost-estimate.json (non-BLOCK) plus matching approve_cost (approval.json digest). Automated tests alone may use SEARCHBENCH_SKIP_COST_SPEND_APPROVAL=1 (see AGENTS.md).

Diagnostic pairwise rounds (isolate layers)

Minimal manifests under configs/rounds/diagnostic-* intentionally change one knob versus the live live-ic-vs-jcodemunch baseline so you can A/B report.json + stderr prefixes without matrix-scale noise.

Round packageSwitchCompared with
diagnostic-fake-eval-live-playersFake Evaluator (no Cerebras/evaluator.eino.callback traffic on Comparison) — real Incumbent + Challenger.//configs/rounds/live-ic-vs-jcodemunch:live_smoke (same default dataset flags): isolates evaluator vs Evidence capture.
diagnostic-incumbent-fake-eval-icFake IncumbentPolicy (stub-incumbent) + real IC Challenger + fake Evaluator.diagnostic-fake-eval-live-players (both real agents, fake evaluator): separates IC-only MCP churn from incumbent + challenger interplay.

See each README.md for interpretation notes (configs/rounds/diagnostic-fake-eval-live-players/README.md, configs/rounds/diagnostic-incumbent-fake-eval-ic/README.md).

GitHub epic repro rounds (#114 stack — #115#119)

Each numbered issue owns a configs/rounds/diagnostic-issue-* Buck package (validate, live_smoke, evaluate_n, real_lca_smoke, evaluate_matrix_*). README.md spells targets, expected proof, canonical bundles, plus buck2 … :validate_bundle for offline completeness (AGENTS.md).

GitHub issueRound packageCanonical single-plane …/games/code-localization/rounds/ basename
#115diagnostic-issue-115-evaluator-scope (prefer :evaluate_n)diagnostic-issue-115-eval-scope-001/
#116diagnostic-issue-116-round-bundle-id (:live_smoke / :evaluate_n)diagnostic-issue-116-round-bundle-001/
#117diagnostic-issue-117-matrix-observability (matrix + traces)diagnostic-issue-117-matrix-trace-001/ + matrix artefacts under evidence/matrix/
#118diagnostic-issue-118-ic-anchor (:evaluate_n)diagnostic-issue-118-ic-anchor-001/
#119diagnostic-issue-119-proof-modes (modes on one manifest)diagnostic-issue-119-proof-modes-001/ ⚠ rewired between runs
ResourceTopic
bundles.mdBundle observability/ NDJSON contract.
run-entrypoints.mdSpend gates and Buck entry semantics.
configs/rounds/live-ic-vs-jcodemunch/README.mdTarget table including evaluate_matrix_5* presets.
configs/rounds/diagnostic-fake-eval-live-players/README.mdPair live_smoke vs fake-eval comparator.
configs/rounds/diagnostic-incumbent-fake-eval-ic/README.mdIC-only MCP isolate (stub incumbent).

Cursor IDE: .cursor/ is intentionally gitignored in this repo. If you use a Cursor subagent markdown that mirrors these greps/workflows, duplicate it privately under ~/.cursor/agents/ or keep prompts in src/docs (this file).

Quick checklist before filing an infra issue

  1. Grepped evaluate_matrix\\.run_finished and match_finished (elapsed_seconds, first_infrastructure_error).
  2. Confirmed --matrix-force if you intended a fresh rerun (else cache skips).
  3. Separated on_error / max iterations bursts from intermittent HTTP outages (provider.http.round_trip).
  4. Redacted list_repos / MCP payloads when sharing excerpts (they enumerate local workspaces).