Helix — Self-Evolution Research

Six methods compared

Two families: distillation-based (need GPU/training) and memory-based (no training needed). We chose the best from each.

🧠

MemRL

arXiv:2601.03192 (2026)

Episodic memory + Q-value utility. High-utility memories retrieved more, low-utility forgotten. Frozen backbone — LLM unchanged, memory evolves.

Helix fit

Easy

VC story

✓ Already shipped in v1.5

⚗️

EvolveR

arXiv:2510.16079 (2025)

Experience → abstract strategic principles. Not "nonce → refresh" but "state desync → refresh from source of truth." Knowledge distillation, not model distillation.

Helix fit

Easy

VC story

→ Next: Q2 2026

⚔️

Self-Play SWE-RL

arXiv:2512.18552 (2025)

Challenger generates failures → Repair agent fixes → Verifier validates. No real users needed. 24/7 evolution. Paper title: "toward superintelligent agents."

Helix fit

Easy

VC story

⭐ Most recommended

🔄

SDPO

arXiv:2601.20802 (2026)

Self-Distillation Policy Optimization. Model sees own failures + rich feedback → teaches itself. No external teacher. No reward model. Gene Map is the feedback.

Helix fit

Easy

VC story

→ Post-funding: Q4 2026

🧪

Classic Distillation

DeepSeek-style teacher→student

GPT-4 as teacher → train 7B student model. Student replaces LLM fallback. 100× faster, 1000× cheaper. But ceiling = teacher level.

Helix fit

Easy

VC story

△ Needs GPU + ML team

∞

GVU Theory

arXiv:2512.02731 (2025)

Mathematical proof: Generator→Verifier→Updater. If signal > noise, κ > 0, improvement is guaranteed. "Second Law of AGI Dynamics." PCEC is a GVU instance.

Helix fit

Easy

VC story

Theory foundation

MemRL — Why it fits Helix perfectly

Gene Map IS episodic memory. Q-value IS utility scoring. We already do: high-Q genes retrieved first, low-Q genes naturally forgotten (natural selection). The MemRL paper validates our architecture — we implemented it independently before the paper was published.

Limitation

Learning ceiling = LLM's reasoning ability. Memory gets better, but the "brain" (LLM) stays the same. Solved by combining with Self-Play and SDPO later.

EvolveR — Knowledge distillation, not model distillation

Current Gene Map: nonce error on Tempo → refresh_nonce (Q=0.92). After EvolveR: state desync pattern → refresh from authoritative source (abstract principle). The abstract principle transfers better to new platforms than concrete examples.

Implementation

Offline phase: LLM summarizes clusters of repair records → extracts patterns → stores as "strategic principles" in Gene Map. Online phase: when new error arrives, match against principles first, concrete genes second. 2-3 weeks of work.

Self-Play SWE-RL — The AlphaGo moment

Three internal agents play against each other:

Challenger: generates increasingly hard payment failures using error taxonomy + mutation
Repair: PCEC attempts to fix (the agent being trained)
Verifier: validates repair correctness against expected state

Key insight from paper: "no need for human-labeled issues or tests." Only needs a sandbox. We can build a payment failure simulator using our 31 existing scenarios as seeds, then mutate/combine them to create harder ones.

Why it can surpass GPT-4

GPT-4 has never seen most of these payment failures. Self-play generates failures GPT-4 never encountered. The repair agent learns from these → becomes better than GPT-4 at payment repair. This is the AlphaGo Zero effect.

SDPO — Self-teaching without a teacher

Traditional distillation: GPT-4 (teacher) → train student. SDPO: model sees own mistakes + Gene Map feedback → teaches itself. The "self-teacher" is the same model conditioned on feedback — it can identify what went wrong in hindsight.

Why post-funding

Needs GPU compute for training. Needs 100K+ Gene Map records for good feedback. Both require scale we don't have yet. But the architecture (Gene Map as rich feedback) is already in place.

Classic Distillation — Why we defer it

Pros: conceptually simple, VC understands "our DeepSeek." Cons:

Needs GPU cluster ($10K+/month)
Needs ML engineer (hiring)
Student ceiling = teacher level (can't surpass GPT-4)
Model goes stale (GPT-4 → GPT-5 → re-distill)
6+ months to production

Self-Play + EvolveR achieves similar goals without any of these costs. We distill knowledge into Gene Map, not parameters into a model.

GVU — The mathematical guarantee

The paper proves that ANY self-improving system can be decomposed into Generator→Verifier→Updater. PCEC maps directly:

Generator = Construct (generates repair candidates)
Verifier = Evaluate + Verify (scores and validates)
Updater = Gene Map Q-value update

The "Variance Inequality" theorem: κ > 0 when generation+verification noise < signal strength. In our case: when Gene Map has enough data and Perceive is accurate enough, improvement is mathematically guaranteed.

We cite this as theoretical validation. We don't need to implement anything new — PCEC already IS a GVU operator.

The core trade-off

Do we train our own model, or evolve without training?

Distillation path

Train a dedicated model from Gene Map data

✓ Zero external LLM dependency
✓ Cost → $0 per repair
✓ VC loves "our own model"
✗ Needs GPU ($10K+/mo)
✗ Needs ML engineer
✗ Ceiling = teacher model
✗ 6+ months to production

vs

Memory evolution path ⭐

Evolve Gene Map knowledge without model training

✓ No GPU needed
✓ No ML team needed
✓ 3 months to results
✓ Can surpass GPT-4 (self-play)
✓ LLM calls approach 0 naturally
△ Still uses LLM for novel errors
△ Less "sexy" than own model

"We don't distill models. We distill knowledge. Gene Map remembers experience (MemRL), abstracts it into principles (EvolveR), generates new knowledge through self-play (SWE-RL), and proves convergence mathematically (GVU). No GPU needed."

— Our answer to "why not just fine-tune?"

→ Decision: Memory evolution now. Distillation after funding. Best of both worlds, sequenced by resources.

Recommended architecture

3 layers now (no GPU). 2 layers post-funding (with GPU). Theory foundation throughout.

Layer 0 — Gene Dream cycle

next sprint

Background memory consolidation. Cluster → prune → consolidate → enrich → reindex. REM sleep for Gene Map.

Inspired by Claude Code Auto Dream + EvolveR (2025)

Layer 1 — Memory-driven learning

✓ v1.5 shipped

Gene Map + Q-value utility + cross-platform transfer + predictive graph

Based on MemRL (2026)

Layer 2 — Experience distillation

Q2 2026

Raw repair records → abstract strategic principles → cross-domain transfer

Based on EvolveR (2025)

Layer 3 — Self-play evolution

Q3 2026

Challenger ↔ Repair ↔ Verifier. 24/7 autonomous evolution. No users needed.

Based on Self-Play SWE-RL (2025)

Layer 4 — Self-distillation (post-funding)

Q4 2026

SDPO: model teaches itself from Gene Map feedback. Zero external dependency.

Based on SDPO (2026)

Layer 5 — Full recursive loop

2027

Better model → better self-play → better data → better model → κ > 0 recursive

Validated by GVU Theory (2025)

Gene Dream runs as a background daemon (or manually via npx helix dream). Triggers when Gene Map exceeds 1000 records AND 24h since last dream AND 50+ new repairs. Uses file lock to prevent concurrent runs. Read-only on agent code, write access only to Gene Map. Five stages: cluster similar genes → prune Q<0.2 stale genes → consolidate clusters into meta-genes (abstract principles) → enrich with conditional context → reindex predictive graph. The consolidation step is what makes this more than memory management — it's knowledge creation.

Already live: Gene Map stores repair capsules with Bayesian Q ± σ. Adaptive α (new genes learn fast). Context-aware lookup (adjusts by gas price, time, chain). Cross-platform: Tempo Gene protects Coinbase agents. Predictive graph preloads next failure.

Offline phase: LLM clusters similar repair records → extracts abstract patterns → stores as strategic principles. Example: 50 nonce-related repairs across 3 platforms → one abstract principle: "when nonce state diverges from chain source of truth, re-derive nonce from chain head." This principle applies to any future blockchain nonce mechanism.

Build a payment failure simulator using 31 existing scenarios as seeds. Challenger mutates and combines them: "what if nonce error + gas spike happen simultaneously?" → compound scenarios humans never designed. Repair agent learns to handle these. After 10K self-play rounds, the system has seen more failure combinations than any production deployment.

With 100K+ Gene Map records (from real users + self-play), SDPO trains a dedicated Helix model. The model reads its own repair attempts + Gene Map feedback → learns where it went wrong → improves. No external teacher needed. Just Gene Map as "rich feedback" per the SDPO paper.

The trained Helix model generates higher-quality self-play data → better data trains a better model → better model generates even better data. This is the recursive loop. GVU theory proves κ > 0 when Gene Map is large enough and Perceive is accurate enough. Both conditions will be met by 2027.

The recursive loop

Each cycle produces a system strictly better than the last. Not a promise — a theorem.

🧠

① Remember — MemRL

Gene Map accumulates repair experience. Q-value tracks what works. Memory is the substrate of intelligence.

arXiv:2601.03192 · shipped v1.5

⚗️

② Abstract — EvolveR via Gene Dream

Raw memories distilled into strategic principles. The Gene Dream cycle (inspired by REM sleep + Claude Code Auto Dream) runs offline: cluster → prune → consolidate → enrich → reindex. Concrete → abstract. Instance → pattern.

arXiv:2510.16079 · Q2 2026

⚔️

③ Challenge — Self-Play

System generates harder problems than reality provides. Solves them. Learns. Surpasses teacher models. The AlphaGo moment.

arXiv:2512.18552 · Q3 2026

🔄

④ Internalize — SDPO

All new knowledge distilled back into a self-taught model. No external teacher. Gene Map feedback is the curriculum. Cost → $0.

arXiv:2601.20802 · Q4 2026

∞

⑤ Recurse — GVU Loop

Better model → better self-play → better data → better abstraction → better model. Each cycle: κ > 0. Improvement accelerates.

arXiv:2512.02731 · 2027

↻ Step ⑤ feeds back to step ①. Each cycle the system is strictly better than the last. GVU proves: when signal > noise, this convergence is mathematically guaranteed.

Implementation timeline

v1.5 ✓

MemRL

Gene Map + Q-value

Q2 2026

EvolveR

Abstract principles

Q3 2026

Self-Play

24/7 auto-evolve

Q4 2026

SDPO

Self-distillation

2027

Full loop

κ > 0 recursive

Gene Dream cycle

Inspired by human REM sleep and Claude Code's Auto Dream — memory consolidation that creates new knowledge.

Awake phase (real-time)

Gene Map accumulates repair records as they happen. Fast, raw, unprocessed. Every repair = one new gene. Q-values update incrementally. No consolidation.

Trigger: every repair() call

Dream phase (background)

Gene Dream runs offline. Clusters similar genes, prunes stale knowledge, consolidates into abstract principles, enriches context, rebuilds indices. Wakes up smarter.

Trigger: 1000+ genes && 24h+ && 50+ new repairs

Five dream stages

①

Cluster

Group similar genes by error embedding similarity. "nonce mismatch on Tempo" + "AA25 nonce on Coinbase" + "privy nonce desync" → cluster: nonce state divergence

②

Prune

Remove genes with Q < 0.2 and age > 7 days. These strategies have been tried and failed consistently. Dead branches cut.

③

Consolidate

Merge clusters into abstract principles via LLM. 3 concrete nonce genes → 1 meta-gene: "When nonce state diverges from chain source of truth, re-derive from chain head. Works across all EVM chains." This is knowledge creation, not just cleanup.

④

Enrich

Add conditional context. "refresh_nonce works" becomes "refresh_nonce works when gas < 50 gwei, fails when > 200 gwei (use speed_up instead)." Vague → precise.

⑤

Reindex

Rebuild Predictive Failure Graph, update Error Embedding signatures, refresh Strategy Chain detection rules. The whole system benefits from cleaner knowledge.

Claude Code Auto Dream

Reviews past transcripts. Prunes stale memories. Consolidates contradictions. Organizes into indexed files. Memory management.

→ Better retrieval

Helix Gene Dream

Everything Auto Dream does, plus: abstracts concrete repairs into strategic principles. Creates meta-genes. Generates new knowledge from existing knowledge. Knowledge creation.

→ Better retrieval + new knowledge

🌙 Gene Dream is the bridge between MemRL (raw memory) and EvolveR (abstract principles). It's how raw experience becomes wisdom. Same biology, applied to agents.

For investors

"DeepSeek distills models. We distill knowledge. Gene Map remembers (MemRL) → abstracts (EvolveR) → self-challenges (Self-Play SWE-RL) → self-teaches (SDPO) → recurses (GVU). No GPU. No ML team. The intelligence lives in the data, not the parameters."

— Why we chose this path

"AlphaGo didn't need human games to surpass humans. Helix doesn't need real failures to surpass GPT-4 at repair. Self-play generates failure scenarios no human designed. The repair agent evolves in simulation, deploys in production. Four arXiv papers validate every step."

— The AlphaGo analogy

"Competitors copy code. They can't copy 10K rounds of self-play. They can't copy 100K Gene Map records. They can't copy abstract strategic principles distilled over 6 months. Every day Helix runs, the moat widens. Not because we write more code — because the system evolves itself."

— The compounding moat

How Helix evolves itself

MemRL — Why it fits Helix perfectly

Limitation

EvolveR — Knowledge distillation, not model distillation

Implementation

Self-Play SWE-RL — The AlphaGo moment

Why it can surpass GPT-4

SDPO — Self-teaching without a teacher

Why post-funding

Classic Distillation — Why we defer it

GVU — The mathematical guarantee

Distillation path

Memory evolution path ⭐