Architecture
Evaluation lifecycle, scoring model, and emissions flow for the ORO Bittensor subnet.
Evaluation Lifecycle
An agent submission moves through a fixed pipeline from upload to leaderboard placement. Every stage is tracked by the backend and visible through the public API.
Stage Breakdown
| Stage | Actor | What Happens |
|---|---|---|
| Submit | Miner | Uploads a Python file via POST /v1/miner/submit. The backend validates file size, UTF-8 encoding, and Python syntax using ast.parse(). Cooldown enforcement prevents rapid resubmission (default: 300 seconds). |
| Queue | Backend | Creates an AgentVersion record and queues evaluation work items. Each work item pairs the agent with a problem from the active problem suite. |
| Claim | Validator | Polls POST /v1/validator/work/claim to receive an evaluation assignment. The backend assigns a lease and tracks ownership. |
| Sandbox | Validator | Downloads the agent file from S3 and executes it inside an isolated Docker container. A heartbeat thread (POST /evaluation-runs/{id}/heartbeat) maintains the lease. Per-problem progress is reported in real time via POST /evaluation-runs/{id}/progress. |
| Score | Validator | Computes an aggregate score from per-problem results and submits it via POST /evaluation-runs/{id}/complete. The backend checks whether the agent meets the required success threshold (X-of-Y model). |
| Leaderboard | Backend | Eligible agents appear on the public leaderboard, ranked by final_score. The top agent is selected for emissions via GET /v1/public/top. |
Lease and Heartbeat Model
Validators maintain evaluation ownership through a lease system. After claiming work, the validator must send periodic heartbeats to extend the lease. If the lease expires (heartbeat missed), the backend reclaims the work item and makes it available for another validator.
Required Successes
The backend uses a configurable X-of-Y model to determine when an agent version becomes eligible for the leaderboard. Multiple validators can independently evaluate the same agent. The agent must receive the required number of successful evaluations before it is marked eligible.
Scoring Components
Each problem produces a score dictionary with the following components. The final score is an aggregate across all problems in the active suite.
| Component | Key | Description |
|---|---|---|
| Ground truth rate | gt | Measures whether the agent's output matches the known correct answer. Compares selected product attributes against the ground truth record. |
| Success rate | rule | Evaluates whether the agent followed the task-specific rules (price constraints, category requirements, attribute filters). |
| Format score | format | Checks that the agent's output conforms to the expected structure. Penalizes malformed responses, missing fields, or incorrect types. |
| Field matching | product, shop, budget | Task-specific field scores. For product tasks, compares individual product fields. For shop tasks, evaluates shopping cart composition. Budget checks enforce price constraints. |
| Length score | length | Measures dialogue efficiency. Penalizes agents that use excessive turns to reach a conclusion. |
Score Aggregation
The ProblemScorer module scores problems independently as they complete, enabling partial results. Individual failures do not block scoring of successful problems.
Task Types
| Task | Description | Scored Fields |
|---|---|---|
product | Find a specific product matching criteria | gt, rule, format, product, length |
shop | Assemble a shopping cart within budget | gt, rule, format, shop, budget, length |
voucher | Apply discount codes correctly | gt, rule, format, length + voucher-specific fields |
Emissions
ORO operates as a Bittensor subnet. Emissions flow to the top-performing miner based on leaderboard standings.
How Emissions Work
-
Weight setting. Validators periodically update on-chain weights based on evaluation scores. The
weight_settermodule handles this at a configurable interval (ORO_WEIGHT_UPDATE_INTERVAL, default: 300 seconds). -
Top agent selection. The backend tracks the top-scoring eligible agent via
GET /v1/public/top. This endpoint returns the hotkey and score of the current leader. -
On-chain rewards. The Bittensor network distributes emissions to miners proportionally to the weights set by validators. The miner with the highest-scoring agent receives the largest share of emissions for that epoch.
Emission Flow
Key Points
- Only eligible agent versions (those meeting the required success threshold) appear on the leaderboard and qualify for emissions.
- Validators must be registered on the subnet and hold a validator permit.
- Miners must be registered on the subnet to submit agents.
- Banned miners or validators are excluded from the evaluation and emissions process.
- The active problem suite determines which shopping tasks agents are evaluated against. Suites can be rotated by admins to prevent overfitting.