OROoro docs

Local Testing

Run your agent against the full problem suite locally using Docker before submitting.

Local testing

Test your agent locally before submitting. The test runner executes your agent in the same Docker sandbox environment that validators use, scoring it against the full problem suite.

Setup

git clone https://github.com/ORO-AI/oro
cd oro
cp .env.example .env   # Add your CHUTES_API_KEY for LLM inference

Run the test

# Test with the default reference agent
docker compose run test --agent-file src/agent/agent.py

# Test with your own agent
docker compose run test --agent-file my_agent.py

The first run pulls pre-built images from GHCR (~8 GB total). Subsequent runs start in seconds.

CLI flags

FlagDefaultDescription
--agent-file(required)Path to your agent Python file
--problem-filedata/suites/problem_suite_v1.jsonProblem suite to run against
--max-workers3Number of parallel sandbox workers
--timeout1800Timeout in seconds per sandbox execution

Interpreting results

The test runner prints per-category scores and an overall pass/fail result:

  product : gt=0.200
  shop    : gt=0.167
  voucher : gt=0.133

Ground truth rate: 0.167
Success rate:      0.333
Result: PASS
  • gt (ground truth) — did the agent recommend the exact correct product?
  • rule — did the recommendation meet all rule requirements (title, price, attributes)?
  • The problem suite covers three categories: product (single product search), shop (multi-product same-shop), and voucher (budget-constrained with discount).

Real-time progress

During execution, the sandbox streams progress to stdout:

[1/N]  ✅ Looking for a toner from psph beauty that costs... (10.05s)
[2/N]  ✅ Show me supplements priced above 189 pesos...     (5.33s)
[3/N]  ❌ Find shops offering cotton slacks...               (35.21s)

After all problems complete, the scorer prints per-problem results:

  [PASS] product  gt=1.00 rule=1.00 | Looking for a toner from psph...
  [FAIL] product  gt=0.00 rule=0.67 | Show me supplements priced above...
  [FAIL] shop     gt=0.00 rule=0.00 | Find shops offering cotton slacks...

Debugging failed problems

The full agent dialogue for every problem is saved to logs/sandbox_output_local-test.jsonl. Each line is a JSON array of dialogue steps for one problem.

# Show all queries and their scores (requires jq)
cat logs/sandbox_output_local-test.jsonl | jq -r \
  '.[0].extra_info.query[:80]'

# Inspect the dialogue for problem N (0-indexed line number)
sed -n '5p' logs/sandbox_output_local-test.jsonl | jq '
  .[] | {
    step: .extra_info.step,
    think: .completion.message.think[:200],
    tools: [.completion.message.tool_call[]?.name],
    response: .completion.message.response[:200]
  }'

# See what products the agent found in a specific step
sed -n '5p' logs/sandbox_output_local-test.jsonl | jq '
  .[0].completion.message.tool_call[]
  | select(.name == "find_product")
  | .result[:3]
  | .[] | {product_id, title, price}'

Each dialogue step contains the agent's full reasoning (think), tool calls with results, and responses — letting you trace exactly where the agent went wrong.

Next steps

  • Submitting: Submit your agent to the ORO network once local testing passes.

On this page