Helix · Engineering Plan

31 failure scenarios across 5 platforms at launch · Cross-platform Gene immunity proven in demo

Milestone & Infrastructure Roadmap

From hackathon demo to production payment operating system. Each phase covers: what to build, what infrastructure you need, what it costs, and the key technical decisions.

Phase 1: Production SDK

Timeline: 2 weeksTeam: 1 engineerInfra cost: ~$0-20/mo

Turn the hackathon demo into a production SDK that moves real money on Tempo. PCEC Commit calls real chain functions. Gene Map stores real repair history. Everything else stays the same.

Infrastructure

Helix SDK (library)

Node.jsTypeScriptnpm package

Not a server. A library that runs inside the agent's own process. wrap() intercepts errors, PCEC repairs, retries. Zero separate infrastructure.

Key decision: SDK is a library, not a microservice. This means zero deployment overhead for users — just npm install. No Docker, no sidecar, no port forwarding.

Gene Map

SQLiteWAL mode

Local .db file per agent instance. Stores Gene Capsules with success counts and timing. Zero ops overhead. Survives process restarts.

Why not Postgres? Single-writer workload. Embedded database. Zero config. You don't need Postgres until Phase 2 when multiple instances share Gene data.

Tempo chain access

Tempo RPCmppx SDK

Connect to Tempo public RPC or self-hosted node. Execute real DEX swaps, TIP-20 transfers, session renewals, nonce queries via mppx TypeScript SDK.

This is the only new dependency. Everything else from hackathon stays. mppx is Tempo's official SDK. Start with public RPC, self-host later for reliability.

What you DON'T need

No Redis. No message queue. No Postgres. No Kubernetes. No Docker. No separate microservices. No load balancer. Just npm install and it works.

Resist over-architecture. Phase 1 goal: prove real money flows through PCEC correctly. One successful DEX swap repair in production is worth more than a perfect K8s setup.

Deliverables

✓

Real Commit execution — swap_currency calls Tempo DEX, renew_session calls MPP API, refresh_nonce queries RPC

done

✓

End-to-end test — trigger real 402 on testnet, PCEC repairs, payment succeeds, Gene stored

done

✓

Gene Map migration — add source/confidence fields, expiry column, migration system for schema updates

done

✓

Safety guards — monthly budget cap enforcement, per-repair cost ceiling, wallet key isolation

done

✓

npm publish — @helix-agent/core v1.2.0 on npm registry, README, TypeDoc API docs, example project

done

✓

Multi-platform adapter system — Tempo (13 scenarios), Privy (4 scenarios), Generic HTTP (3 scenarios), pluggable adapter registry

done

✓

174 tests, 23 test files

done

✓

npm @helix-agent/core v1.2.0

done

✓

31 scenarios, 5 platforms, 26 real strategies

done

✓

MCP server (5 tools) + MPP API (Railway)

done

✓

CLI: status, simulate, gc, stats

done

Infrastructure summary

Component	Choice
Servers	0 (library, not service)
Database	SQLite (embedded)
Cache	none
Message queue	none
External APIs	Tempo RPC + mppx SDK
Deploy complexity	npm install

~$0

Monthly infrastructure cost
SDK runs in agent's existing process

Phase 2: Intelligence Layer

Timeline: 2 weeksTeam: 1-2 engineersInfra cost: ~$50-150/mo

Add LLM Fallback for unknown failures, upgrade Gene Map to shared Postgres, add Redis caching on the critical path, and build the event replay system for debugging and audit.

Infrastructure

Helix API server

Node.jsExpress

Same SDK process. Adds LLM Fallback: on Gene Map miss, sends failure context to Claude/GPT API, validates candidates in sandbox, stores result as new Gene with confidence:0.6.

Gene Map → PostgreSQL

PostgreSQL 16Supabase or RDS

Upgrade from SQLite when 2+ agent instances need shared Gene data. Adds columns: source ('pcec'|'llm'|'human'), confidence (0-1), expires_at. Full migration system.

Trigger: Switch when you have 2+ instances. If still single-instance, SQLite is fine. Don't upgrade prematurely.

Redis cache

Redis 7Upstash or ElastiCache

Cache hot Gene lookups. PCEC Perceive checks Redis first (sub-ms), falls back to Postgres (5ms). TTL = 5 min. Invalidate on Gene store.

Why now: Gene lookup is on every payment's critical path. At 100+ repairs/min, 5ms Postgres round-trip adds up. Redis gives you <1ms.

LLM API (Claude/GPT)

Claude APIOpenAI API

Unknown failure handler. Only called on Gene Map MISS (~5% initially, drops to 1-2% after first weeks). Each call creates a Gene → same failure never triggers LLM twice.

Cost model: ~$0.01 per LLM call × 2% miss rate × 1000 repairs/day = ~$0.20/day. Converges to near-zero as Gene Map fills up.

Event log / replay store

PostgreSQLNATS JetStream (optional)

Every PCEC step logged as append-only event. Enables: deterministic replay ("why did it swap instead of retry?"), audit trails, debugging. Postgres append-only table for storage. NATS JetStream only if you need real-time streaming to multiple consumers (dashboard, alerting).

Start simple: Postgres append-only table. Add NATS only when you need real-time event streaming to 3+ consumers.

Key decisions

Question	Answer	Reasoning
When to switch from SQLite?	2+ instances	SQLite is single-process. The moment you run two agents sharing repair knowledge, you need Postgres.
Redis vs in-memory cache?	Redis	Process restarts lose in-memory. Redis persists across restarts and can be shared across instances.
Claude vs GPT for fallback?	Both, with failover	Call Claude primary, GPT fallback. If Claude is down, don't let unknown failures go unhandled.
NATS vs Redis pub/sub?	Skip for now	Redis pub/sub loses messages when subscriber is offline. But you don't need real-time broadcast yet. Postgres polling is fine.

Phase 2 Platform Expansion

Privy wallet failure handling — nonce desync, gas sponsor depletion, cross-chain mismatch, policy spending limits

done

Stripe adapter — card_declined, expired_card, rate_limit (placeholder → full implementation)

1 week

Cross-platform Gene immunity — proven: Tempo Genes heal Privy failures automatically

done

Phase 2 Completed (v1.2.0)

✓

Gene Combine (ELL) + Gene GC

done

✓

Root Cause Analysis (MAST) — 13 mappings

done

✓

Gene Reasoning + failure_analysis (ReasoningBank)

done

✓

Failure Attribution (Who&When)

done

✓

Gene Links / co-occurrence (A-Mem)

done

✓

Schema Versioning v3 + auto-migration

done

✓

Zod validation for strategy params

done

✓

simulate() testing framework

done

✓

Seed Gene Map (12 pre-loaded, cold start solved)

done

✓

Idempotency (repair_id + repair_log)

done

✓

L1 in-memory Gene cache (30s TTL)

done

✓

DEX swap (Uniswap V3 on Base/Ethereum via viem)

done

✓

EventBus with backpressure

done

Infrastructure summary

Component	Choice
Servers	1 (API + PCEC, can be same box)
Database	PostgreSQL 16 (shared Gene Map + event log)
Cache	Redis 7 (Gene lookup hot path)
Message queue	none yet (Postgres polling)
External APIs	Tempo RPC + Claude API + OpenAI API
Deploy	Docker Compose (3-4 containers)

~$80/mo

Postgres $25 (Supabase) + Redis $15 (Upstash) + LLM ~$20 + server $20
Scales to ~$150 at 10K repairs/day

Phase 3: Network Layer

Timeline: 2 weeksTeam: 1-2 engineersInfra cost: ~$300-800/mo

The architecture complexity jump. From single-process library to distributed system. Gene Registry server, NATS message broker, agent mesh, on-chain smart contracts. This is where the network effect starts.

Infrastructure

Gene Registry server

Node.js or GoREST + WebSocket

NEW service. Receives Gene pushes from all agents, deduplicates, quality-scores (successRate × usage × proof), broadcasts to subscribers. Serves helix gene push/pull CLI commands.

Why separate service: Cannot run inside SDK. Needs to be always-on, handle concurrent pushes from hundreds of agents, deduplicate efficiently. This is your "npm registry" for Genes.

NATS JetStream

NATSJetStream

Required now. Gene broadcast to all subscribed agents. Agent subscribes to error codes it cares about. NATS handles: pub/sub, persistent delivery, replay from offset.

Why NATS not Kafka: Gene broadcast doesn't need Kafka's throughput (millions/sec). NATS is lighter, JetStream adds persistence. Kafka is overkill until 10K+ agents.

PostgreSQL cluster

PostgreSQL 16Primary + read replicapgvector (optional)

All published Genes. Indexed by failure_code, category, chain, success_rate. Read replica for Registry queries (high read volume). pgvector if you want "find Gene closest to this failure pattern" similarity search.

Redis cluster

Redis Cluster3 nodes

Gene lookup cache for all agents. Registry popularity rankings. Rate limiting for push/pull API. WebSocket session store for mesh connections.

Tempo smart contracts

SolidityTempo mainnet

On-chain Gene anchoring: store Gene hash + proof. $HELIX ERC-20 token on Tempo. Staking contract for Gene quality. Governance contract for protocol parameters.

Trust layer: Without contracts, Registry is centralized. With them, anyone can verify Gene proofs and the registry can't censor contributions.

Agent mesh relay

WebSocket relay→ libp2p later

Agent-to-agent communication for cascade scenarios (A→B→C). State sharing, coordinated rollback, distributed SAGA. Start with WebSocket relay server. Migrate to libp2p for true P2P later.

Start centralized: WebSocket relay is 1 week to build. libp2p is 6 weeks. Get the flow working first, decentralize later.

Key decisions

Question	Answer	Reasoning
NATS vs Kafka?	NATS JetStream	10x lighter. JetStream adds persistence. Kafka only at 10K+ agents with millions of events/sec.
pgvector needed?	Nice-to-have	Exact match on failure_code covers 95%. Similarity search helps with "close but not identical" failures. Add when you have data to prove it helps.
WebSocket vs libp2p?	WebSocket first	1 week vs 6 weeks. Get mesh working. Decentralize later. Most agents are behind NATs anyway.
Token on Tempo vs L2?	Tempo native	You're building FOR Tempo. Token on Tempo means Gene anchoring + token transfers are same-chain atomic.
Kubernetes needed?	Yes, now	5+ services, NATS, Postgres cluster, Redis cluster. Docker Compose won't cut it for reliability.

Infrastructure summary

Component	Choice
Servers	3 (API, Registry, Mesh relay)
Database	PostgreSQL cluster (primary + read replica)
Cache	Redis Cluster (3 nodes)
Message queue	NATS JetStream (Gene broadcast + events)
Smart contracts	$HELIX token + Gene anchor + staking
Deploy	Kubernetes (8-12 pods)

~$500/mo

3 servers ($150) + Postgres ($80) + Redis ($50) + NATS ($30) + K8s ($100) + Tempo gas ($50)
Scales to ~$800 at 100 agents

Phase 4: Full Operating System

Timeline: 2 weeksTeam: 1-2 engineersInfra cost: ~$1,500-3,000/mo

Natural language objectives → task decomposition → DAG execution → PCEC at every step. Uses Claude/GPT for intent parsing (don't build your own NLU). Temporal.io for orchestration (don't build your own scheduler).

Infrastructure

ODE intent server

Node.jsClaude API

Receives natural language objectives. Calls Claude for intent parsing → structured JSON task graph. Validates output. Builds DAG. Hands off to Temporal orchestrator.

Don't build NLU. Claude/GPT do intent parsing. You build the validation + DAG construction + execution framework. LLM = brain. Helix = hands + immune system.

Temporal.io orchestrator

Temporal serverWorker pool

Executes task DAGs with durability. Each task node is a Temporal Activity wrapped in PCEC. Handles: parallel execution, conditional branches, timeout, retry, rollback, visibility.

Don't build your own orchestrator. Temporal gives you: durable execution, automatic retry, workflow persistence, replay, visibility UI. Building this yourself = 3+ months. Temporal = 2 weeks to integrate.

Evolution worker

Python or Node.jsNATS consumer

Background worker. Reads Gene usage data from NATS stream. Runs evolutionary parameter optimization on slippage, splitCount, maxWait. Writes optimized params back to Postgres.

Not on critical path. Runs hourly or triggered by enough new data. Can start as a cron job. Doesn't touch real-time repair flow.

Predictive engine

PythonTimescaleDB

Time-series analysis on Gene Map data. Detects patterns: "Service X fails Mondays 2am." Pre-emptive routing before failure. TimescaleDB = Postgres extension for time-range queries.

No new database. TimescaleDB is a PostgreSQL extension. Add it to existing Postgres. No new ops burden.

Complete system map

User objective

"Pay 100 employees"

→

ODE + Claude

Intent → task DAG

→

Temporal.io

DAG orchestration

→

PCEC per task

Self-heal at each step

→

Gene Map

Learn + evolve

Key decisions

Question	Answer	Reasoning
Build NLU or use LLM?	Claude/GPT API	Intent parsing goes from 12 weeks to 2 weeks. LLM understands "pay 100 employees in 5 currencies." You validate + execute.
Build orchestrator or Temporal?	Temporal.io	DAG scheduling is a solved problem. Building your own: 3 months. Temporal integration: 2 weeks. Plus free visibility UI.
TimescaleDB or InfluxDB?	TimescaleDB	It's a Postgres extension. No new database. No new ops. Same SQL you already know.
Evolution: Python or Node?	Python	Better ML ecosystem (numpy, scipy for parameter optimization). But Node works too if you prefer one language.

Infrastructure summary

Component	Choice
Servers	5-7 (API, Registry, ODE, Temporal, Evolution, Predictive, Mesh)
Database	PostgreSQL + TimescaleDB (primary + 2 replicas)
Cache	Redis Cluster (3+ nodes)
Message queue	NATS JetStream (Gene broadcast + events + evolution data)
Workflow engine	Temporal.io (2 servers + worker pool)
Smart contracts	$HELIX + Gene anchor + staking + governance
Monitoring	Grafana + Prometheus
Deploy	Kubernetes cluster (15-20 pods, 3 namespaces)

~$2,000/mo

7 servers ($400) + Postgres cluster ($150) + Redis ($80) + NATS ($50) + Temporal ($200) + LLM API ($200) + Tempo gas ($100) + K8s ($400) + monitoring ($50)
Scales to ~$3,000 at 1,000 agents