68%90%
Benchmark: Helix SDK — HumanEval + Live MPP Validation
We didn't build a better model.
We built infrastructure that doesn't let failures go to waste.
seed 42 · reproducible +22% from infrastructure alone 12/14 recovered· 2 beyond repair
helix — npm run benchmark

Without Helix

0
pass@1 %
0
completed
0
failures
~18 min
MTTR
recovered
immune
+22%
pass@1
+12
tasks saved
85.7%
recovery rate

With Helix

0
pass@1 %
0
completed
0
recovered
<2s
MTTR
$4,200
protected
0
immune hits
2 failures were beyond repair: 1 genuine context overflow (task needs 200K+ tokens) and 1 persistent server outage (3/3 retries exhausted). We report both.

14 injected failures · 12 recovered

Green = recovered by PCEC. Red = unrecoverable.
Timeout
5/5 → increase_timeout
Rate Limit
3/3 → backoff_retry 2 immune
Context Overflow
1/2 → chunk_input · 1 failed
Server Error
1/2 → retry_endpoint · 1 persistent
JSON Parse
2/2 → repair_json 1 immune

Selected repairs from the benchmark run

Real repairs executed by the PCEC engine during the Helix run.
Task #4 · below_zero · 429 Rate Limit
P: rate_limit → C: backoff_retry (92) vs switch_key (71)
E: backoff_retry wins → K: wait 2s, retry
✓ Fixed in 2.1s · Gene stored
Task #7 · filter_by_substring · TIMEOUT
P: timeout → C: increase_timeout (88) vs chunk_input (65)
E: increase_timeout wins → K: set 20s, retry
✓ Fixed in 1.4s · Gene stored
Task #34 · unique · TIMEOUT
⚡ IMMUNE — Gene hit: timeout → increase_timeout (used 3×)
✓ Instant fix: 0.1s
Task #41 · car_race_collision · Context Overflow
P: context_overflow → C: chunk_input (78) vs reduce_context (62)
E: chunk_input → K: split prompt, retry
✗ STILL FAILED — task genuinely needs 200K+ tokens
Gene note: “HumanEval/40 unrecoverable — flag for human review”
Task #48 · is_palindrome · 500 Server Error
P: server_error → C: retry (85) vs switch_endpoint (72)
E: retry → K: attempt 1/3… 2/3… 3/3…
✗ PERSISTENT — all 3 retries failed
Gene note: “server outage detected — alert + skip”

5 capsules stored · 9 immune hits

Every repair teaches the system. Repeat failures get instant fixes.
rate_limit
→ backoff_retry
Used:
Immune: 2
Avg: 2.1s
timeout
→ increase_timeout
Used:
Immune: 4
Avg: 1.4s
context_overflow
→ chunk_input
Used:
Immune: 0
Avg: 0.9s
json_parse
→ repair_json
Used:
Immune: 1
Avg: 0.3s
server_error
→ retry_endpoint
Used:
Immune: 0
Avg: 1.8s
9 immune hits — repairs that cost <100ms because the Gene Map already knew the fix.
GPT-4o-mini scores 68% on HumanEval.
With Helix: 90%.
+22% improvement.
Not from a better model.
From infrastructure that doesn't let failures go to waste.
12 of 14 recovered. 2 beyond repair. We're honest about limits.
Every repair makes the next one faster. Every Gene Capsule makes the network more resilient. That's the moat.

Deterministic · Seed 42 · Same results every run

Clone, install, run. You'll get the exact same numbers.
git clone https://github.com/adrianhihi/helix
cd helix
npm install

# Run the exact same benchmark (seed 42, same failures)
OPENAI_API_KEY=sk-xxx npm run benchmark

# View results
npx helix dash
# Open localhost:3710/benchmark
Deterministic. Seed 42 produces identical failure injection every run. Our results are reproducible by anyone.

This is one benchmark on one model. Now imagine this at scale.

$1.8T/day

Visa/MC volume today

5–8%

Agent failure rate

$500B/day

At risk

Helix on Tempo's payment lane: every stablecoin transaction that fails gets auto-repaired. Same PCEC engine. Same Gene Map. Same network effect.

Live MPP Validation

Beyond HumanEval, we validated against real MPP services on Tempo:

mpp.dev/api/ping/paid Testnet payment successful
openai.mpp.tempo.xyz TIP20 Uninitialized (mainnet)
parallelmpp.dev TIP20 Uninitialized (mainnet)
exa.mpp.tempo.xyz TIP20 Uninitialized (mainnet)

3 of 4 calls failed with network mismatch — the #1 failure agents face as MPP services launch on mainnet while agents develop on testnet.
Helix Scenario #13 handles this automatically.

488
active agents
11.4K
transactions
50+
MPP services
(source: mppscan.com, live data)
⚡ View Payment Lab → 📄 Read the Docs →