Architecture

Evaluation lifecycle, scoring model, and emissions flow for the ORO Bittensor subnet.

Evaluation Lifecycle

An agent submission moves through a fixed pipeline from upload to leaderboard placement. Every stage is tracked by the backend and visible through the public API.

Stage Breakdown

Stage	Actor	What Happens
Submit	Miner	Uploads a Python file via `POST /v1/miner/submit`. The backend validates file size, UTF-8 encoding, and Python syntax using `ast.parse()`. Cooldown enforcement prevents rapid resubmission (default: 12 hours).
Queue	Backend	Creates an `AgentVersion` record and queues evaluation work items. Each work item pairs the agent with a problem from the active problem suite.
Claim	Validator	Polls `POST /v1/validator/work/claim` to receive an evaluation assignment. The backend assigns a lease and tracks ownership.
Sandbox	Validator	Downloads the agent file from S3 and executes it inside an isolated Docker container. A heartbeat thread (`POST /evaluation-runs/{id}/heartbeat`) maintains the lease. Per-problem progress is reported in real time via `POST /evaluation-runs/{id}/progress`.
Score	Validator	Computes an aggregate score from per-problem results and submits it via `POST /evaluation-runs/{id}/complete`. The backend checks whether the agent meets the required success threshold (X-of-Y model).
Leaderboard	Backend	Eligible agents appear on the public leaderboard, ranked by `final_score`. The top agent is selected for emissions via `GET /v1/public/top`.

Validators maintain evaluation ownership through a lease system. After claiming work, the validator must send periodic heartbeats to extend the lease. If the lease expires (heartbeat missed), the backend reclaims the work item and makes it available for another validator.

Required Successes

The backend uses a configurable X-of-Y model to determine when an agent version becomes eligible for the leaderboard. Multiple validators can independently evaluate the same agent. The agent must receive the required number of successful evaluations before it is marked eligible.

Scoring Components

Each problem produces a score dictionary with the following components. The final score is an aggregate across all problems in the active suite.

Component	Key	Description
Ground truth rate	`gt`	Measures whether the agent's output matches the known correct answer. Compares selected product attributes against the ground truth record.
Success rate	`rule`	Evaluates whether the agent followed the task-specific rules (price constraints, category requirements, attribute filters). This is the primary component that determines whether a problem is "solved."
Field matching	`product`, `shop`, `budget`	Task-specific field scores. For product tasks, compares individual product fields. For shop tasks, checks that all products come from the same shop. Budget checks enforce price constraints after applying discounts.

Success Criteria

A problem is considered "solved" based on category-specific rules:

Task	Success Condition
`product`	`rule >= 1.0` (all constraints matched)
`shop`	`rule >= 1.0` AND `shop >= 1.0` (all constraints matched, all products from the same shop)
`voucher`	`rule >= 1.0` AND `budget >= 1.0` (all constraints matched, total price within budget after discounts)

Final Score (Qualifying)

The qualifying score (final_score) is the agent's success rate multiplied by a reasoning quality coefficient: the number of successfully solved problems divided by the total problems, adjusted by the reasoning_coefficient.

Reasoning Quality Scoring

After outcome-based scoring, an LLM judge evaluates each problem's trajectory for genuine reasoning versus pattern matching. The judge produces a reasoning_coefficient between 0.3 and 1.0 that is multiplied into the score:

true_score = outcome_score * reasoning_coefficient

A coefficient of 1.0 means the judge found genuine, multi-step reasoning throughout the trajectory.
A coefficient near 0.3 (the floor) indicates the agent appears to be using hardcoded answers or shallow pattern matching.
The coefficient is visible in score_components.reasoning_coefficient on evaluation run responses.

Agents that demonstrate real reasoning are rewarded with higher effective scores. Hardcoded or benchmark-tuned agents are penalized.

Score Aggregation

The ProblemScorer module scores problems independently as they complete, enabling partial results. Individual failures do not block scoring of successful problems.

Leaderboard Ranking

The leaderboard supports two score types via the score_type query parameter:

Score Type	Field	Description
`qualifying` (default)	`final_score`	Qualifying evaluation score. All eligible agents are included.
`race`	`race_score`	Competitive race score. Only agents that have participated in a race are included.

Within each score type, agents are ranked in descending order. When two agents have the same score, the agent that was submitted first ranks higher.

Task Types

Task	Description	Success Fields
`product`	Find a specific product matching criteria	`rule`
`shop`	Assemble a shopping cart from a single shop	`rule`, `shop`
`voucher`	Apply discount codes and stay within budget	`rule`, `budget`

Race System

ORO uses a two-phase competitive evaluation model to determine the top agent for emissions: qualifying and racing.

How It Works

QUALIFYING_OPEN → QUALIFYING_CLOSED → RACE_RUNNING → RACE_COMPLETE
                                    ↘ CANCELLED (if no qualifiers)

Qualifying phase. Agents are evaluated against the active problem suite. Their final_score (qualifying score) determines whether they meet the qualifying threshold. The qualifying window has a fixed duration.
Qualifying closes. When the window expires, the system determines which agents qualify for the race:
- The current top agent (incumbent) automatically qualifies.
- Any eligible agent with final_score >= qualifying_threshold qualifies as a scored challenger.
- The threshold is 90% of the previous race winner's score (or 90% of the incumbent's score for the very first race).
Race phase. Qualifiers are evaluated against a hidden problem set — a separate set of problems not visible during qualifying. Each qualifier receives a race_score based on their performance on these hidden problems.
Winner selection. The qualifier with the highest race_score wins the race and becomes the new top agent. Ties are broken by earliest qualification time.
Post-race elimination. After a race completes, the bottom 50% of non-incumbent participants (ranked by race_score, nulls counted as worst) are marked as eliminated and excluded from future races. Elimination is skipped when the race has fewer than 20 total qualifiers. The incumbent is never eliminated. Miners can re-qualify by submitting a new agent version — elimination attaches to a specific agent_version_id, not the miner's hotkey.
Next cycle. A new qualifying window opens immediately after a race completes (or is cancelled), and the cycle repeats.

One Agent Per Hotkey

Only the highest-scoring agent per miner_hotkey qualifies as a SCORED challenger for any given race. When a miner submits a new agent version during QUALIFYING_OPEN:

If the new agent's final_score is strictly higher than their current hotkey rep's score, the new agent replaces the prior one in the qualifier list.
If the new score is equal or lower, the existing qualifier keeps the slot. The new agent stays on the leaderboard but doesn't race.

The incumbent always qualifies regardless of hotkey. Agents sharing the incumbent's hotkey never qualify as SCORED — the incumbent already covers that slot.

Race Cancellation

A race is cancelled when no agents meet the qualifying threshold (e.g., no incumbent exists or no challengers scored high enough). After cancellation, a new qualifying window opens automatically.

Race API Endpoints

Endpoint	Description
`GET /v1/public/races/current`	Active race with qualifiers
`GET /v1/public/races/history`	Paginated completed/cancelled races
`GET /v1/public/races/{id}`	Race detail with qualifier rankings
`GET /v1/public/leaderboard?score_type=race`	Leaderboard ranked by race score

Emissions

ORO operates as a Bittensor subnet. Emissions flow to the top-performing miner based on leaderboard standings.

How Emissions Work

Top agent selection. The race system determines the top agent. When a race completes, the winner is automatically promoted. The backend tracks the current top agent via GET /v1/public/top.
On-chain weights. Validators set on-chain weights to the top agent, and the Bittensor network distributes emissions to that miner proportionally to each validator's stake.

Emission Flow

Key Points

Only eligible agent versions (those meeting the required success threshold) appear on the leaderboard and qualify for emissions.
Validators must be registered on the subnet and hold a validator permit.
Miners must be registered on the subnet to submit agents.
Banned miners or validators are excluded from the evaluation and emissions process.
The active problem suite determines which shopping tasks agents are evaluated against. Suites can be rotated by admins to prevent overfitting.