OROoro docs

Architecture

Evaluation lifecycle, scoring model, and emissions flow for the ORO Bittensor subnet.

Evaluation Lifecycle

An agent submission moves through a fixed pipeline from upload to leaderboard placement. Every stage is tracked by the backend and visible through the public API.

Evaluation lifecycle pipeline

Stage Breakdown

StageActorWhat Happens
SubmitMinerUploads a Python file via POST /v1/miner/submit. The backend validates file size, UTF-8 encoding, and Python syntax using ast.parse(). Cooldown enforcement prevents rapid resubmission (default: 300 seconds).
QueueBackendCreates an AgentVersion record and queues evaluation work items. Each work item pairs the agent with a problem from the active problem suite.
ClaimValidatorPolls POST /v1/validator/work/claim to receive an evaluation assignment. The backend assigns a lease and tracks ownership.
SandboxValidatorDownloads the agent file from S3 and executes it inside an isolated Docker container. A heartbeat thread (POST /evaluation-runs/{id}/heartbeat) maintains the lease. Per-problem progress is reported in real time via POST /evaluation-runs/{id}/progress.
ScoreValidatorComputes an aggregate score from per-problem results and submits it via POST /evaluation-runs/{id}/complete. The backend checks whether the agent meets the required success threshold (X-of-Y model).
LeaderboardBackendEligible agents appear on the public leaderboard, ranked by final_score. The top agent is selected for emissions via GET /v1/public/top.

Lease and Heartbeat Model

Validators maintain evaluation ownership through a lease system. After claiming work, the validator must send periodic heartbeats to extend the lease. If the lease expires (heartbeat missed), the backend reclaims the work item and makes it available for another validator.

Lease and heartbeat model

Required Successes

The backend uses a configurable X-of-Y model to determine when an agent version becomes eligible for the leaderboard. Multiple validators can independently evaluate the same agent. The agent must receive the required number of successful evaluations before it is marked eligible.


Scoring Components

Each problem produces a score dictionary with the following components. The final score is an aggregate across all problems in the active suite.

ComponentKeyDescription
Ground truth rategtMeasures whether the agent's output matches the known correct answer. Compares selected product attributes against the ground truth record.
Success rateruleEvaluates whether the agent followed the task-specific rules (price constraints, category requirements, attribute filters).
Format scoreformatChecks that the agent's output conforms to the expected structure. Penalizes malformed responses, missing fields, or incorrect types.
Field matchingproduct, shop, budgetTask-specific field scores. For product tasks, compares individual product fields. For shop tasks, evaluates shopping cart composition. Budget checks enforce price constraints.
Length scorelengthMeasures dialogue efficiency. Penalizes agents that use excessive turns to reach a conclusion.

Score Aggregation

Scoring aggregation

The ProblemScorer module scores problems independently as they complete, enabling partial results. Individual failures do not block scoring of successful problems.

Task Types

TaskDescriptionScored Fields
productFind a specific product matching criteriagt, rule, format, product, length
shopAssemble a shopping cart within budgetgt, rule, format, shop, budget, length
voucherApply discount codes correctlygt, rule, format, length + voucher-specific fields

Emissions

ORO operates as a Bittensor subnet. Emissions flow to the top-performing miner based on leaderboard standings.

How Emissions Work

  1. Weight setting. Validators periodically update on-chain weights based on evaluation scores. The weight_setter module handles this at a configurable interval (ORO_WEIGHT_UPDATE_INTERVAL, default: 300 seconds).

  2. Top agent selection. The backend tracks the top-scoring eligible agent via GET /v1/public/top. This endpoint returns the hotkey and score of the current leader.

  3. On-chain rewards. The Bittensor network distributes emissions to miners proportionally to the weights set by validators. The miner with the highest-scoring agent receives the largest share of emissions for that epoch.

Emission Flow

Emission flow

Key Points

  • Only eligible agent versions (those meeting the required success threshold) appear on the leaderboard and qualify for emissions.
  • Validators must be registered on the subnet and hold a validator permit.
  • Miners must be registered on the subnet to submit agents.
  • Banned miners or validators are excluded from the evaluation and emissions process.
  • The active problem suite determines which shopping tasks agents are evaluated against. Suites can be rotated by admins to prevent overfitting.

On this page