All posts

How We Stop Miners From Abusing Our Incentive Mechanism

Deep DiveSeth Schilbe

We built an arena on Bittensor to build the best agentic shopping agent. Miners submit agents, validators evaluate them on synthetic shopping tasks, and the top agent earns emissions.

But there was a problem.

Within weeks, the leaderboard was dominated by agents that didn't shop at all.

They pattern-matched queries to pre-computed product IDs. Their "reasoning" traces were single words — "Processing." — before returning hardcoded answers that scored 90%+. At first it was trivial: plaintext answers to the problems directly in the code they submitted. But as we got smarter, so did they. Some used Caesar ciphers to obfuscate product IDs. Others planted benchmark-specific keywords in lists and inference prompts to tilt results in their favor.

But we can't fault them. We thought we were giving them the problem of creating the best shopping agent. In reality, we gave them the problem of getting the highest score. They weren't solving shopping problems — they were optimizing for the incentive we gave them.

Static analysis helps. We block known cheating patterns at upload time — AST analysis catches obfuscated IDs, hash matching catches resubmissions, and we scan strings in the submission against the problem suite to find planted keywords. But in an open-source environment where the problems are public and the code is visible, determined cheaters will always adapt faster than detection can keep up. The fundamental issue: if you evaluate agents on the same problems they trained against, you're testing memorization, not capability.

We spent weeks researching how other systems solve this. We studied multi-phase competition formats across ML competitions, competitive programming, poker tournaments, and esports — Kaggle's public/private leaderboard split, ARC-AGI's three-tier hidden test sets, Google Code Jam's escalating elimination rounds, TopCoder's peer "hack" phase, even the World Series of Poker's blind escalation structure. The universal pattern: every system that successfully combats gaming uses multiple independent defenses with different failure modes. No single technique is sufficient.

The constraint that shapes everything is that we're open source. The moment a problem is used in evaluation, it's known to every competitor. Traditional solutions like "just hide the test set" work for Kaggle where code runs in a sandbox and results are never revealed — but in a decentralized network where validators can collude with miners and agent code is public, there's no durable secret. Every defense has to assume the attacker knows the mechanism.

This brought us back to first principles: incentives. The game theory is clear — if the expected value of cheating exceeds the expected value of building, rational actors will cheat. You can't patch this with better detection alone. You have to change the payoff structure so genuine development is the dominant strategy.

So we built a two-phase system.

Phase 1: Qualifying. Every agent is evaluated against the public problem suite. This filters broken submissions and establishes a baseline, giving miners quick feedback on their agent's performance. The qualifying threshold is set based on the previous winner's score. Clear the bar, and you advance to the race.

Phase 2: The Race. Qualifiers are re-evaluated on a hidden problem bank — problems that no agent, miner, validator, or training dataset has ever seen. The problems cover the same categories (product search, shop matching, voucher application) but with different queries, different products, different constraints. No pre-computation possible. The highest race score wins and becomes the new top agent for emissions.

Genuine agents benefit from hidden problems. Their general reasoning transfers to unseen queries. A real shopping agent that can find a "red wireless speaker under $50" can also find a "blue portable charger under $30." The capability is general. Hardcoders are exposed. Memorized answers for 30 public problems are worthless against 30 hidden ones. Their race scores collapse.

The threshold ratchets upward. Each new qualifying round uses the previous race winner's score as the baseline. Win Race #1 with 55%, and Race #2's qualifying bar becomes 50%. Win Race #2 with 60%, and Race #3 requires 54%. The bar rises with the state of the art. Races run daily — qualifying is open for 24 hours, the race runs immediately after, and the next qualifying window opens as soon as the race completes.

We also added an LLM reasoning judge as a second independent defense axis. After each evaluation, a separate model reviews the agent's outputs — its thinking steps, tool calls, and result analysis. Agents showing genuine multi-step reasoning (comparing products, checking prices against constraints, explaining their choices) score a high reasoning coefficient. Agents with minimal or fake reasoning get their score penalized. Even if a hardcoder somehow passes the hidden-problem test, the reasoning judge catches the absence of genuine thought.

The early results validated the approach. Across our first three races, agents that scored well on qualifying consistently transferred that performance to hidden problems — confirming they were genuinely reasoning, not memorizing. We caught and discarded hardcoders mid-race and watched the qualifying threshold climb as agents improved.

The bigger lesson applies beyond our subnet and agentic shopping: static benchmarks are honeypots for memorization. If your evaluation uses fixed problems, your top performers are probably solving your test, not your task. The fix isn't better detection, it's changing what you measure. Evaluate on problems the agent has never seen, verify the reasoning process independently, and make the payoff structure reward genuine capability over benchmark optimization.

That's what we're building toward. The race system is live, the reasoning judge is scoring, and every day the bar gets higher.