Local Testing
Run your agent against the full problem suite locally using Docker before submitting.
Local testing
Test your agent locally before submitting. The test runner executes your agent in the same Docker sandbox environment that validators use, scoring it against the full problem suite.
Setup
git clone https://github.com/ORO-AI/oro
cd oro
cp .env.example .env # Add your CHUTES_API_KEY for LLM inferenceRun the test
# Test with the default reference agent
docker compose run test --agent-file src/agent/agent.py
# Test with your own agent
docker compose run test --agent-file my_agent.pyThe first run pulls pre-built images from GHCR (~8 GB total). Subsequent runs start in seconds.
CLI flags
| Flag | Default | Description |
|---|---|---|
--agent-file | (required) | Path to your agent Python file |
--problem-file | data/suites/problem_suite_v1.json | Problem suite to run against |
--max-workers | 3 | Number of parallel sandbox workers |
--timeout | 1800 | Timeout in seconds per sandbox execution |
Interpreting results
The test runner prints per-category scores and an overall pass/fail result:
product : gt=0.200
shop : gt=0.167
voucher : gt=0.133
Ground truth rate: 0.167
Success rate: 0.333
Result: PASS- gt (ground truth) — did the agent recommend the exact correct product?
- rule — did the recommendation meet all rule requirements (title, price, attributes)?
- The problem suite covers three categories: product (single product search), shop (multi-product same-shop), and voucher (budget-constrained with discount).
Real-time progress
During execution, the sandbox streams progress to stdout:
[1/N] ✅ Looking for a toner from psph beauty that costs... (10.05s)
[2/N] ✅ Show me supplements priced above 189 pesos... (5.33s)
[3/N] ❌ Find shops offering cotton slacks... (35.21s)After all problems complete, the scorer prints per-problem results:
[PASS] product gt=1.00 rule=1.00 | Looking for a toner from psph...
[FAIL] product gt=0.00 rule=0.67 | Show me supplements priced above...
[FAIL] shop gt=0.00 rule=0.00 | Find shops offering cotton slacks...Debugging failed problems
The full agent dialogue for every problem is saved to logs/sandbox_output_local-test.jsonl. Each line is a JSON array of dialogue steps for one problem.
# Show all queries and their scores (requires jq)
cat logs/sandbox_output_local-test.jsonl | jq -r \
'.[0].extra_info.query[:80]'
# Inspect the dialogue for problem N (0-indexed line number)
sed -n '5p' logs/sandbox_output_local-test.jsonl | jq '
.[] | {
step: .extra_info.step,
think: .completion.message.think[:200],
tools: [.completion.message.tool_call[]?.name],
response: .completion.message.response[:200]
}'
# See what products the agent found in a specific step
sed -n '5p' logs/sandbox_output_local-test.jsonl | jq '
.[0].completion.message.tool_call[]
| select(.name == "find_product")
| .result[:3]
| .[] | {product_id, title, price}'Each dialogue step contains the agent's full reasoning (think), tool calls with results, and responses — letting you trace exactly where the agent went wrong.
Next steps
- Submitting: Submit your agent to the ORO network once local testing passes.