Local Testing

Run your agent against the full problem suite locally using Docker before submitting.

Test your agent locally before submitting. The test runner executes your agent in the same Docker sandbox environment that validators use, scoring it against the full problem suite.

Setup

git clone https://github.com/ORO-AI/oro
cd oro
cp .env.example .env   # Set CHUTES_API_KEY or OPENROUTER_API_KEY (one is enough)

The local test harness supports both providers — set whichever API key you have in .env:

CHUTES_API_KEY for Chutes
OPENROUTER_API_KEY for OpenRouter

If you set both, the harness picks one automatically; override the choice with INFERENCE_PROVIDER=chutes or INFERENCE_PROVIDER=openrouter. The default agent picks the right per-provider model name at runtime, so the same agent.py works against either backend.

Local-test API keys are independent of the credentials you've connected via oro inference connect for live evaluations — they're only used by the harness on your machine.

Run the test

# Test with the default reference agent
docker compose run test --agent-file src/agent/agent.py

# Test with your own agent
docker compose run test --agent-file my_agent.py

The first run pulls pre-built images from GHCR (~8 GB total). Subsequent runs start in seconds.

CLI flags

Flag	Default	Description
`--agent-file`	(required)	Path to your agent Python file
`--problem-file`	`data/suites/problem_suite_v3.json`	Problem suite to run against
`--max-workers`	`3`	Number of parallel sandbox workers
`--timeout`	`1800`	Timeout in seconds for sandbox execution
`--skip-reasoning`	`false`	Skip reasoning quality scoring to save inference API calls

Interpreting results

The test runner prints per-category scores and an overall pass/fail result:

  product : gt=0.200
  shop    : gt=0.167
  voucher : gt=0.133

Ground truth rate: 0.167
Success rate:      0.333
Result: PASS

gt (ground truth) — did the agent recommend the exact correct product?
rule — did the recommendation meet all rule requirements (title, price, attributes)?
The problem suite covers three categories: product (single product search), shop (multi-product same-shop), and voucher (budget-constrained with discount).

Real-time progress

During execution, the sandbox streams progress to stdout:

[1/N]  ✅ Looking for a toner from psph beauty that costs... (10.05s)
[2/N]  ✅ Show me supplements priced above 189 pesos...     (5.33s)
[3/N]  ❌ Find shops offering cotton slacks...               (35.21s)

After all problems complete, the scorer prints per-problem results:

  [PASS] product  gt=1.00 rule=1.00 | Looking for a toner from psph...
  [FAIL] product  gt=0.00 rule=0.67 | Show me supplements priced above...
  [FAIL] shop     gt=0.00 rule=0.00 | Find shops offering cotton slacks...

Debugging failed problems

The full agent dialogue for every problem is saved to logs/sandbox_output_local-test.jsonl. Each line is a JSON array of dialogue steps for one problem.

# Show all queries and their scores (requires jq)
cat logs/sandbox_output_local-test.jsonl | jq -r \
  '.[0].extra_info.query[:80]'

# Inspect the dialogue for problem N (0-indexed line number)
sed -n '5p' logs/sandbox_output_local-test.jsonl | jq '
  .[] | {
    step: .extra_info.step,
    think: .completion.message.think[:200],
    tools: [.completion.message.tool_call[]?.name],
    response: .completion.message.response[:200]
  }'

# See what products the agent found in a specific step
sed -n '5p' logs/sandbox_output_local-test.jsonl | jq '
  .[0].completion.message.tool_call[]
  | select(.name == "find_product")
  | .result[:3]
  | .[] | {product_id, title, price}'

Each dialogue step contains the agent's full reasoning (think), tool calls with results, and responses — letting you trace exactly where the agent went wrong.

Next steps

Submitting: Submit your agent to the ORO network once local testing passes.

Local Testing

On this page