SearchBench: Evaluating Agent Interfaces, Not Just Agents
Research note — not the operational docs index. Start at ../index.md.
SearchBench is an evaluation harness for studying how different repository interfaces change the behavior of coding agents.
The central claim is simple:
The model is not the whole system. The interface changes what the agent is capable of.
Most coding-agent benchmarks ask:
Which model writes better code?
SearchBench asks a different question:
Which environment makes an agent behave like a better engineer?
That means evaluating not only models, but the tools, graphs, validation surfaces, artifacts, and feedback loops surrounding them.
Thesis
Agent performance is not only a property of the model.
It is distributed across:
model
prompt
tool surface
repo structure
build graph
code graph
validation loop
artifact format
human inspection surfaceA weak interface can make a strong model wasteful, confused, or overconfident.
A strong interface can make the same model more reliable, efficient, and inspectable.
SearchBench exists to measure that difference.
Core Research Question
Given the same model, same repository, and same task:
How does the available interface change the agent's behavior?For example:
same model
same repo
same task
different tool surface
different outcomeThis lets us test whether agents improve when the repository exposes better operational structure.
Interface Families
SearchBench can compare at least three broad interface families.
1. Raw Repository Interface
The agent receives normal repo access:
files
grep/search
shell commands
README prose
ad hoc scriptsThis is the default world most coding agents operate in.
The agent has to infer:
what files matter
what commands exist
what tests prove the change
what generated files need updating
when it is doneThis interface is realistic, but it forces the model to recover project policy from scattered prose and convention.
2. Code Intelligence Interface
The agent receives code-structure tools:
symbols
references
call graph
file graph
semantic search
bounded lookaheadExamples:
CodeQL
LSP
jCodeMunch
Iterative Context
static code graphsThis interface helps answer:
Where should I look?
What code is related?
What calls what?
What files are likely relevant?This is the current core SearchBench direction: measuring whether better code-search and graph-exploration tools improve localization and repair behavior.
3. Work Graph Interface
The agent receives repository-operation tools:
Buck2 targets
build graph queries
test suites
proof targets
generated artifact dependencies
legal operationsExamples:
buck2 targets
buck2 uquery
buck2 test //:check
buck2 test //:check_full
config-bundle targets
release-candidate targetsThis interface helps answer:
What am I allowed to do?
What does this action depend on?
What proves this change?
Which validation target is appropriate?
What is too expensive or too live for normal checks?
When am I done?This may be as important as code search.
Code graphs help the agent understand code.
Work graphs help the agent understand engineering lifecycle.
Key Distinction
Code graph:
lookahead over meaning
Work graph:
lookahead over actionA code graph helps an agent find relevant files.
A work graph helps an agent choose the right operation and proof.
Many agent failures are not failures of syntax or code generation. They are lifecycle failures:
ran the wrong checks
skipped generated files
edited the wrong layer
over-tested
under-tested
used live/manual targets in deterministic gates
did not know what counted as doneBuck-like systems can expose those lifecycle rules as graph structure instead of prose.
Buck2 as an Agent Interface
In this framing, Buck2 is not only a build system.
It is an agent-facing action graph.
A Buck target is a named legal move:
//:check
//:check_full
//src/searchbench-go:check
//src/iterative-context:check_full
//configs/rounds/optimize-ic:round_validate
//configs/rounds/optimize-ic:ic_workspace_smokeInstead of asking an agent to infer commands like:
go test ./...
pytest
ruff check
basedpyright
pkl eval
repomixthe repository can expose:
these are the legal operations
these are their dependencies
these are their costs
these are their artifacts
these are their proof obligationsThat turns the repo from a pile of scripts into a structured operational environment.
Structured Agent Operations
A future SearchBench experiment could avoid asking the agent to write raw Buck/Starlark.
Instead, the agent could emit structured operations:
{
"operation": "extend_test_suite",
"package": "",
"suite": "check_full",
"add_tests": [
"//configs/rounds/optimize-ic:round_validate"
],
"reason": "deterministic config validation belongs in the full gate"
}A deterministic renderer would turn this into a Buck file edit.
Then Buck validates the graph.
Then SearchBench records whether the agent chose the right operation and proof.
This is the important pattern:
agent emits semantic intent
system renders sanctioned repo operation
Buck validates the work graph
tests validate behavior
bundle records evidenceThe agent does not need arbitrary shell access to be useful.
It needs a good action language.
Possible Experiment Design
SearchBench can run the same task under different tool surfaces.
Baseline
files
grep
shell
README proseCandidate A: Code Graph
files
grep
shell
code graph tools
symbol lookup
references
bounded lookaheadCandidate B: Work Graph
files
grep
shell
Buck target listing
Buck query
Buck target execution
structured Buck operationsCandidate C: Hybrid
code graph
work graph
bundle evidence
release reportThe model stays constant.
The repo stays constant.
The task stays constant.
Only the interface changes.
Example Tasks
SearchBench could evaluate tasks like:
Add a config bundle and expose the correct validation target.
Determine which target proves a change to a round manifest.
Add a generated-file check to the full deterministic gate.
Fix a failing Buck target with the smallest patch.
Add an IC workspace smoke target without putting live/provider-backed work into //:check.
Given a changed file set, choose the minimal proof targets.
Promote a release candidate only if the correct evidence bundle passes.These tasks measure agentic engineering behavior, not just code editing.
Metrics
SearchBench should score more than pass/fail.
Potential metrics:
correct patch
correct target selected
minimal proof selected
invalid commands attempted
irrelevant commands attempted
tokens spent
wall-clock time
number of retries
files touched
unrelated files touched
over-testing
under-testing
lifecycle policy violations
whether the agent stopped with a valid proofA key metric could be:
proof distanceThat means:
How far was the agent's chosen validation path from the minimal correct proof path?For example, if the correct proof is:
//configs/rounds/optimize-ic:round_validatebut the agent runs:
go test ./...
pytest
nix flake check
buck2 test //:check_fullthe task may pass, but the agent has shown weak operational understanding.
SearchBench can measure that.
Hypothetical Result
A useful SearchBench result might look like:
Same model, same repo, same task.
Raw repo interface:
42k tokens
17 tool calls
6 irrelevant commands
wrong validation target
Code graph interface:
24k tokens
9 tool calls
found correct files faster
still unsure what to run
Work graph interface:
18k tokens
6 tool calls
selected correct proof target
stopped cleanly
Hybrid interface:
14k tokens
5 tool calls
correct patch
correct proof
clean evidence bundleThis would support the claim that interface design changes effective agent capability.
Relationship to SearchBench's Existing Direction
SearchBench already studies code-localization interfaces:
baseline retrieval
vs
candidate graph explorationThe Buck/work-graph direction extends the same idea from code search to engineering lifecycle.
Iterative Context:
better code-search lookahead
Buck2:
better action/proof lookahead
Bundles:
better evidence and release memory
Visualization:
better human inspectionTogether, these form a broader research product:
SearchBench evaluates agent environments.Not just agents.
Product Thesis
SearchBench should not only answer:
Which model is best?It should answer:
Which interface makes this model behave better?This is the deeper product category.
SearchBench can compare:
models
prompts
tools
code graphs
work graphs
artifact bundles
validation loops
visualization surfacesunder a shared evaluation harness.
The result is a way to study agentic engineering systems as systems.
Philosophical Claim
The underlying philosophy is:
Tools matter because they change what cognition is cheap, expensive, visible, or impossible.
For agents, this is especially important.
A model operating over raw files and shell commands must infer too much.
A model operating inside a well-designed environment can spend more of its effort on the actual engineering problem.
The goal is not to make the model smarter in isolation.
The goal is to design an environment where the model's intelligence is usable, bounded, inspectable, and correctable.
Short Version
CodeQL makes code queryable.
Buck makes work queryable.
SearchBench can measure whether queryable work makes agents better engineers.Or:
The benchmark is not just the model.
The benchmark is the model inside an environment.