Helix · Engineering Plan
20 failure scenarios across 3 platforms at launch · Cross-platform Gene immunity proven in demo

Milestone & Infrastructure Roadmap

From hackathon demo to production payment operating system. Each phase covers: what to build, what infrastructure you need, what it costs, and the key technical decisions.

Phase 1: Production SDK

Timeline: 2 weeksTeam: 1 engineerInfra cost: ~$0-20/mo

Turn the hackathon demo into a production SDK that moves real money on Tempo. PCEC Commit calls real chain functions. Gene Map stores real repair history. Everything else stays the same.

Infrastructure

Helix SDK (library)

Node.jsTypeScriptnpm package
Not a server. A library that runs inside the agent's own process. wrap() intercepts errors, PCEC repairs, retries. Zero separate infrastructure.
Key decision: SDK is a library, not a microservice. This means zero deployment overhead for users — just npm install. No Docker, no sidecar, no port forwarding.

Gene Map

SQLiteWAL mode
Local .db file per agent instance. Stores Gene Capsules with success counts and timing. Zero ops overhead. Survives process restarts.
Why not Postgres? Single-writer workload. Embedded database. Zero config. You don't need Postgres until Phase 2 when multiple instances share Gene data.

Tempo chain access

Tempo RPCmppx SDK
Connect to Tempo public RPC or self-hosted node. Execute real DEX swaps, TIP-20 transfers, session renewals, nonce queries via mppx TypeScript SDK.
This is the only new dependency. Everything else from hackathon stays. mppx is Tempo's official SDK. Start with public RPC, self-host later for reliability.

What you DON'T need

No Redis. No message queue. No Postgres. No Kubernetes. No Docker. No separate microservices. No load balancer. Just npm install and it works.
Resist over-architecture. Phase 1 goal: prove real money flows through PCEC correctly. One successful DEX swap repair in production is worth more than a perfect K8s setup.
Deliverables
1
Real Commit execution — swap_currency calls Tempo DEX, renew_session calls MPP API, refresh_nonce queries RPC
5 days
2
End-to-end test — trigger real 402 on testnet, PCEC repairs, payment succeeds, Gene stored
3 days
3
Gene Map migration — add source/confidence fields, expiry column, migration system for schema updates
1 day
4
Safety guards — monthly budget cap enforcement, per-repair cost ceiling, wallet key isolation
1 day
5
npm publish — @helix-agent/core on npm registry, README, TypeDoc API docs, example project
2 days
6
Multi-platform adapter system — Tempo (13 scenarios), Privy (4 scenarios), Generic HTTP (3 scenarios), pluggable adapter registry
done
Infrastructure summary
ComponentChoice
Servers0 (library, not service)
DatabaseSQLite (embedded)
Cachenone
Message queuenone
External APIsTempo RPC + mppx SDK
Deploy complexitynpm install
~$0
Monthly infrastructure cost
SDK runs in agent's existing process

Phase 2: Intelligence Layer

Timeline: 2 weeksTeam: 1-2 engineersInfra cost: ~$50-150/mo

Add LLM Fallback for unknown failures, upgrade Gene Map to shared Postgres, add Redis caching on the critical path, and build the event replay system for debugging and audit.

Infrastructure

Helix API server

Node.jsExpress
Same SDK process. Adds LLM Fallback: on Gene Map miss, sends failure context to Claude/GPT API, validates candidates in sandbox, stores result as new Gene with confidence:0.6.

Gene Map → PostgreSQL

PostgreSQL 16Supabase or RDS
Upgrade from SQLite when 2+ agent instances need shared Gene data. Adds columns: source ('pcec'|'llm'|'human'), confidence (0-1), expires_at. Full migration system.
Trigger: Switch when you have 2+ instances. If still single-instance, SQLite is fine. Don't upgrade prematurely.

Redis cache

Redis 7Upstash or ElastiCache
Cache hot Gene lookups. PCEC Perceive checks Redis first (sub-ms), falls back to Postgres (5ms). TTL = 5 min. Invalidate on Gene store.
Why now: Gene lookup is on every payment's critical path. At 100+ repairs/min, 5ms Postgres round-trip adds up. Redis gives you <1ms.

LLM API (Claude/GPT)

Claude APIOpenAI API
Unknown failure handler. Only called on Gene Map MISS (~5% initially, drops to 1-2% after first weeks). Each call creates a Gene → same failure never triggers LLM twice.
Cost model: ~$0.01 per LLM call × 2% miss rate × 1000 repairs/day = ~$0.20/day. Converges to near-zero as Gene Map fills up.

Event log / replay store

PostgreSQLNATS JetStream (optional)
Every PCEC step logged as append-only event. Enables: deterministic replay ("why did it swap instead of retry?"), audit trails, debugging. Postgres append-only table for storage. NATS JetStream only if you need real-time streaming to multiple consumers (dashboard, alerting).
Start simple: Postgres append-only table. Add NATS only when you need real-time event streaming to 3+ consumers.
Key decisions
QuestionAnswerReasoning
When to switch from SQLite?2+ instancesSQLite is single-process. The moment you run two agents sharing repair knowledge, you need Postgres.
Redis vs in-memory cache?RedisProcess restarts lose in-memory. Redis persists across restarts and can be shared across instances.
Claude vs GPT for fallback?Both, with failoverCall Claude primary, GPT fallback. If Claude is down, don't let unknown failures go unhandled.
NATS vs Redis pub/sub?Skip for nowRedis pub/sub loses messages when subscriber is offline. But you don't need real-time broadcast yet. Postgres polling is fine.
Phase 2 Platform Expansion
A
Privy wallet failure handling — nonce desync, gas sponsor depletion, cross-chain mismatch, policy spending limits
done
B
Stripe adapter — card_declined, expired_card, rate_limit (placeholder → full implementation)
1 week
C
Cross-platform Gene immunity — proven: Tempo Genes heal Privy failures automatically
done
Infrastructure summary
ComponentChoice
Servers1 (API + PCEC, can be same box)
DatabasePostgreSQL 16 (shared Gene Map + event log)
CacheRedis 7 (Gene lookup hot path)
Message queuenone yet (Postgres polling)
External APIsTempo RPC + Claude API + OpenAI API
DeployDocker Compose (3-4 containers)
~$80/mo
Postgres $25 (Supabase) + Redis $15 (Upstash) + LLM ~$20 + server $20
Scales to ~$150 at 10K repairs/day

Phase 3: Network Layer

Timeline: 2 weeksTeam: 1-2 engineersInfra cost: ~$300-800/mo

The architecture complexity jump. From single-process library to distributed system. Gene Registry server, NATS message broker, agent mesh, on-chain smart contracts. This is where the network effect starts.

Infrastructure

Gene Registry server

Node.js or GoREST + WebSocket
NEW service. Receives Gene pushes from all agents, deduplicates, quality-scores (successRate × usage × proof), broadcasts to subscribers. Serves helix gene push/pull CLI commands.
Why separate service: Cannot run inside SDK. Needs to be always-on, handle concurrent pushes from hundreds of agents, deduplicate efficiently. This is your "npm registry" for Genes.

NATS JetStream

NATSJetStream
Required now. Gene broadcast to all subscribed agents. Agent subscribes to error codes it cares about. NATS handles: pub/sub, persistent delivery, replay from offset.
Why NATS not Kafka: Gene broadcast doesn't need Kafka's throughput (millions/sec). NATS is lighter, JetStream adds persistence. Kafka is overkill until 10K+ agents.

PostgreSQL cluster

PostgreSQL 16Primary + read replicapgvector (optional)
All published Genes. Indexed by failure_code, category, chain, success_rate. Read replica for Registry queries (high read volume). pgvector if you want "find Gene closest to this failure pattern" similarity search.

Redis cluster

Redis Cluster3 nodes
Gene lookup cache for all agents. Registry popularity rankings. Rate limiting for push/pull API. WebSocket session store for mesh connections.

Tempo smart contracts

SolidityTempo mainnet
On-chain Gene anchoring: store Gene hash + proof. $HELIX ERC-20 token on Tempo. Staking contract for Gene quality. Governance contract for protocol parameters.
Trust layer: Without contracts, Registry is centralized. With them, anyone can verify Gene proofs and the registry can't censor contributions.

Agent mesh relay

WebSocket relay→ libp2p later
Agent-to-agent communication for cascade scenarios (A→B→C). State sharing, coordinated rollback, distributed SAGA. Start with WebSocket relay server. Migrate to libp2p for true P2P later.
Start centralized: WebSocket relay is 1 week to build. libp2p is 6 weeks. Get the flow working first, decentralize later.
Key decisions
QuestionAnswerReasoning
NATS vs Kafka?NATS JetStream10x lighter. JetStream adds persistence. Kafka only at 10K+ agents with millions of events/sec.
pgvector needed?Nice-to-haveExact match on failure_code covers 95%. Similarity search helps with "close but not identical" failures. Add when you have data to prove it helps.
WebSocket vs libp2p?WebSocket first1 week vs 6 weeks. Get mesh working. Decentralize later. Most agents are behind NATs anyway.
Token on Tempo vs L2?Tempo nativeYou're building FOR Tempo. Token on Tempo means Gene anchoring + token transfers are same-chain atomic.
Kubernetes needed?Yes, now5+ services, NATS, Postgres cluster, Redis cluster. Docker Compose won't cut it for reliability.
Infrastructure summary
ComponentChoice
Servers3 (API, Registry, Mesh relay)
DatabasePostgreSQL cluster (primary + read replica)
CacheRedis Cluster (3 nodes)
Message queueNATS JetStream (Gene broadcast + events)
Smart contracts$HELIX token + Gene anchor + staking
DeployKubernetes (8-12 pods)
~$500/mo
3 servers ($150) + Postgres ($80) + Redis ($50) + NATS ($30) + K8s ($100) + Tempo gas ($50)
Scales to ~$800 at 100 agents

Phase 4: Full Operating System

Timeline: 2 weeksTeam: 1-2 engineersInfra cost: ~$1,500-3,000/mo

Natural language objectives → task decomposition → DAG execution → PCEC at every step. Uses Claude/GPT for intent parsing (don't build your own NLU). Temporal.io for orchestration (don't build your own scheduler).

Infrastructure

ODE intent server

Node.jsClaude API
Receives natural language objectives. Calls Claude for intent parsing → structured JSON task graph. Validates output. Builds DAG. Hands off to Temporal orchestrator.
Don't build NLU. Claude/GPT do intent parsing. You build the validation + DAG construction + execution framework. LLM = brain. Helix = hands + immune system.

Temporal.io orchestrator

Temporal serverWorker pool
Executes task DAGs with durability. Each task node is a Temporal Activity wrapped in PCEC. Handles: parallel execution, conditional branches, timeout, retry, rollback, visibility.
Don't build your own orchestrator. Temporal gives you: durable execution, automatic retry, workflow persistence, replay, visibility UI. Building this yourself = 3+ months. Temporal = 2 weeks to integrate.

Evolution worker

Python or Node.jsNATS consumer
Background worker. Reads Gene usage data from NATS stream. Runs evolutionary parameter optimization on slippage, splitCount, maxWait. Writes optimized params back to Postgres.
Not on critical path. Runs hourly or triggered by enough new data. Can start as a cron job. Doesn't touch real-time repair flow.

Predictive engine

PythonTimescaleDB
Time-series analysis on Gene Map data. Detects patterns: "Service X fails Mondays 2am." Pre-emptive routing before failure. TimescaleDB = Postgres extension for time-range queries.
No new database. TimescaleDB is a PostgreSQL extension. Add it to existing Postgres. No new ops burden.

Complete system map

User objective

"Pay 100 employees"

ODE + Claude

Intent → task DAG

Temporal.io

DAG orchestration

PCEC per task

Self-heal at each step

Gene Map

Learn + evolve

Key decisions
QuestionAnswerReasoning
Build NLU or use LLM?Claude/GPT APIIntent parsing goes from 12 weeks to 2 weeks. LLM understands "pay 100 employees in 5 currencies." You validate + execute.
Build orchestrator or Temporal?Temporal.ioDAG scheduling is a solved problem. Building your own: 3 months. Temporal integration: 2 weeks. Plus free visibility UI.
TimescaleDB or InfluxDB?TimescaleDBIt's a Postgres extension. No new database. No new ops. Same SQL you already know.
Evolution: Python or Node?PythonBetter ML ecosystem (numpy, scipy for parameter optimization). But Node works too if you prefer one language.
Infrastructure summary
ComponentChoice
Servers5-7 (API, Registry, ODE, Temporal, Evolution, Predictive, Mesh)
DatabasePostgreSQL + TimescaleDB (primary + 2 replicas)
CacheRedis Cluster (3+ nodes)
Message queueNATS JetStream (Gene broadcast + events + evolution data)
Workflow engineTemporal.io (2 servers + worker pool)
Smart contracts$HELIX + Gene anchor + staking + governance
MonitoringGrafana + Prometheus
DeployKubernetes cluster (15-20 pods, 3 namespaces)
~$2,000/mo
7 servers ($400) + Postgres cluster ($150) + Redis ($80) + NATS ($50) + Temporal ($200) + LLM API ($200) + Tempo gas ($100) + K8s ($400) + monitoring ($50)
Scales to ~$3,000 at 1,000 agents