Complete Implementation Blueprint: Technology Stack & Architecture Guide

15 November 2025 •

#system-design #distributed-systems #ads-tech

Introduction: From Requirements to Reality

Over the past four parts of this series, we’ve built up the architecture for a real-time ads platform serving 1M+ QPS with 150ms P99 latency:

Part 1 established the architectural foundation - requirements analysis, latency budgeting (decomposing 150ms across components), resilience patterns (circuit breakers, graceful degradation), and the P99 tail latency challenge. We identified three critical drivers: revenue maximization, sub-150ms latency, and 99.9% availability. These requirements shaped every decision that followed.

Part 2 designed the dual-source revenue engine - parallelizing internal ML-scored inventory (65ms) with external RTB auctions (100ms) to achieve 30-48% revenue lift over single-source approaches. We detailed the OpenRTB protocol implementation, GBDT-based CTR prediction, feature engineering pipeline, and timeout handling strategies.

Part 3 built the data layer - L1/L2/L3 cache hierarchy (Caffeine → Redis/Valkey → CockroachDB) achieving 78-88% hit rates and sub-10ms reads. We covered eCPM-based auction mechanisms for fair price comparison across CPM/CPC/CPA models, and distributed budget pacing using atomic operations with proven ≤1% overspend guarantee.

Part 4 addressed production operations - pattern-based fraud detection (20-30% bot filtering), active-active multi-region deployment with 2-5min failover, zero-downtime schema evolution, clock synchronization for financial ledgers, observability with error budgets, zero-trust security, and chaos engineering validation.

Part 5 (this post) brings it all together - the complete technology stack with concrete choices, detailed configurations, and integration patterns. This is where abstract requirements become a deployable system.

What This Post Covers

Complete Technology Stack - Every component with specific versions, rationale, and alternatives considered
Technology Decision Framework - The five criteria used for every choice
Runtime & Infrastructure - Java 21 + ZGC configuration, Kubernetes cluster setup, container orchestration
Communication Layer - gRPC setup with connection pooling, Linkerd service mesh configuration
Data Layer - CockroachDB cluster topology, Valkey sharding strategy, Caffeine cache sizing
Feature Platform - Tecton architecture (Offline: Spark + Rift, Online: Redis), Flink integration
Observability - Prometheus + Thanos multi-region setup, Tempo sampling strategy, Grafana dashboards
Integration Patterns - How all components work together as a cohesive system
Validation - How the final architecture meets Part 1’s requirements

Let’s dive into the decisions.

Complete Technology Stack

Here’s the final stack, organized by layer:

Application Layer

Component	Technology	Version	Rationale
Ad Server Orchestrator	Java + Spring Boot	21 LTS	Ecosystem maturity, ZGC availability, team expertise
Garbage Collector	ZGC (Z Garbage Collector)	Java 21+	<1ms p99.9 pauses, eliminates GC as P99 contributor
User Profile Service	Java + Spring Boot	21 LTS	Dual-mode architecture (identity + contextual fallback), consistency with orchestrator
ML Inference	GBDT (LightGBM/XGBoost)	-	Day-1 CTR prediction, 20ms inference. Evolution path: two-pass ranking with distilled DNN reranker (see Part 2)
Budget Service	Java + Spring Boot	21 LTS	Strong consistency requirements, atomic operations
RTB Gateway	Java + Spring Boot	21 LTS	HTTP/2 connection pooling, protobuf support
Integrity Check	Go	1.21+	Sub-ms latency, minimal resource footprint, stateless filtering

Communication Layer

Component	Technology	Rationale
Internal RPC	gRPC over HTTP/2	Binary serialization (3-10× smaller than JSON), type safety, <1ms overhead
External API	REST/JSON over HTTP/2	OpenRTB standard compliance, DSP compatibility
Service Mesh	Linkerd	Lightweight (5-10ms overhead), native gRPC support, mTLS
Service Discovery	Kubernetes DNS	Built-in, no external dependencies, <1ms resolution
Load Balancing	Kubernetes Service + gRPC client-side	L7 awareness, connection-level distribution

Data Layer

Component	Technology	Rationale
L3: Transactional DB	CockroachDB Serverless	User profiles, campaigns, billing ledger. Strong consistency, cross-region ACID transactions, HLC timestamps. 50-75% cheaper than DynamoDB, fully managed. Self-hosted break-even depends on operational costs (see capacity planning).
L2: Distributed Cache	Valkey 7.x (Redis fork)	Budget counters (DECRBY atomic), L2 cache, rate limit tokens. Sub-ms latency, permissive BSD-3 license
L1: In-Process Cache	Caffeine	Hot user profiles, 60-70% hit rate. 8-12× faster than Redis, JVM-native, excellent eviction
Feature Store	Tecton (managed)	Batch (Spark) + Streaming (Rift) + Real-time online store. Sub-10ms P99, Redis-backed

Infrastructure Layer

Component	Technology	Rationale
Container Orchestration	Kubernetes 1.28 or later	Industry standard, declarative config, auto-scaling, multi-region federation
Container Runtime	containerd	Lightweight, OCI-compliant, lower overhead than Docker
Cloud Provider	AWS (multi-region)	Broadest service coverage, mature networking (VPC peering, Transit Gateway)
Regions	us-east-1, us-west-2, eu-west-1	Geographic distribution, <50ms inter-region latency
CDN/Edge	CloudFront + Lambda@Edge	Global PoPs, request routing, geo-filtering

Observability Layer

Component	Technology	Rationale
Metrics	Prometheus + Thanos	Kubernetes-native, multi-region aggregation, PromQL for SLO queries
Distributed Tracing	OpenTelemetry + Tempo	Vendor-neutral, low overhead, latency analysis across services
Logging	Fluentd + Loki	Structured logs, label-based querying, cost-effective storage
Alerting	Alertmanager	Integrated with Prometheus, SLO-based alerts, escalation policies

Technology Decision Framework

Every technology choice in this architecture was evaluated against five criteria:

1. Latency Impact

Does it fit within the component’s latency budget? (From Part 1’s latency decomposition)

Example: ZGC’s <2ms pauses vs G1GC’s 41-55ms pauses
Example: gRPC’s binary protocol vs JSON’s parsing overhead

2. Operational Complexity

How many additional systems, proxies, or failure modes does it introduce?

Example: Envoy Gateway + Linkerd (same proxy) vs Kong + Istio (two different proxies)
Example: Tecton (managed) vs self-hosted Feast

3. Cost Efficiency

What’s the total cost of ownership at 1M+ QPS scale?

Example: CockroachDB 2-3× cheaper than DynamoDB at 1M+ QPS (post-Nov 2024 pricing)
Example: Kubernetes bin-packing achieves 60% more capacity than VMs

4. Team Expertise

Can the team operate it effectively, or does it require hiring specialists?

Example: Java ecosystem maturity vs Go’s smaller tooling ecosystem
Example: Postgres-compatible CockroachDB vs learning Spanner

5. Production Validation

Has it been proven at similar scale by other companies?

Example: Netflix’s ZGC validation at scale
Example: LinkedIn’s Valkey adoption

When trade-offs were necessary, latency always won - because every millisecond lost reduces revenue at 1M+ QPS.

Runtime & Garbage Collection: Java 21 + ZGC

Decision: Java 21 + Generational ZGC

Why Java over Go/Rust:

Ecosystem maturity: Battle-tested libraries for ads (OpenRTB, protobuf, gRPC), mature monitoring tools
Team expertise: Java developers are easier to hire than Rust specialists
Sub-millisecond GC: Modern ZGC eliminates GC as a latency source

Why ZGC over G1GC/Shenandoah:

G1GC: Stop-the-world pauses of 41-55ms at P99.9 - consumes 30% of latency budget
Shenandoah: Concurrent, but higher CPU overhead (15-20% vs ZGC’s 10%)
ZGC: Sub-10ms pauses typical, design goal <1ms. Netflix production deployment (March 2024) on JDK 21 with Generational ZGC reports “no explicit tuning required” for critical streaming services. Achievable with proper heap sizing and allocation rate management.

ZGC Configuration

Key Configuration Decisions:

Heap Sizing: 32GB heap chosen based on allocation rate analysis. With 5,000 QPS per instance and average request creating ~50KB objects, allocation rate reaches 250 MB/sec. At this rate with ZGC’s concurrent collection, heap cycles every ~2 minutes at 50% utilization.

Why 32GB:

Large enough to avoid frequent GC cycles (allocation rate 250 MB/sec)
Small enough for fast evacuation during compaction phases
Matches EC2 instance memory profile: 64GB total (32GB JVM heap + 32GB OS page cache for Redis/file operations)

Thread Pool Strategy:

Request threads: 200 virtual threads (Java 21 Project Loom) - lightweight execution without OS thread limitations, enabling high concurrency without thread pool exhaustion
gRPC threads: 32 threads (2× CPU cores) dedicated to I/O operations for handling network communication with downstream services
Background tasks: 16 threads for async operations like event publishing to Kafka and cache warming

Validation: From Part 1: P99 tail is 10,000 req/sec. With G1GC’s 41-55ms pauses, 410-550 requests would timeout per pause. ZGC’s <2ms P99.9 pauses (32GB heap) affect only 20 requests - 98% reduction in GC-caused timeouts.

Communication Layer: gRPC + Linkerd

gRPC Configuration

Why gRPC over REST/JSON: From Part 1’s latency budget, service-to-service calls must be <10ms. JSON parsing overhead adds 2-5ms per request.

Protocol buffers: 3-10× smaller than JSON, zero-copy deserialization
HTTP/2 multiplexing: Single TCP connection carries multiple RPCs
Streaming: Supports bidirectional streaming (useful for RTB auctions)

Connection Pooling Strategy:

Each Ad Server instance maintains 32 persistent connections to each downstream service. At 5,000 QPS per instance, this yields ~156 requests per second per connection, effectively reusing connections and avoiding expensive connection establishment overhead (TLS handshakes cost 10-20ms).

Key configuration decisions:

Keepalive pings (60s intervals): Detect dead connections proactively before requests fail
Keepalive timeout (20s): Close unresponsive connections to prevent request accumulation
Message size limit (4MB): Prevents memory exhaustion from unexpectedly large responses
Plaintext transport: Encryption handled by Linkerd service mesh at proxy layer, avoiding double-encryption overhead

Load balancing: Round-robin distribution across service replicas with DNS-based service discovery (Kubernetes DNS provides automatic endpoint updates).

Retry Policy: Maximum 2 attempts with exponential backoff (10ms → 50ms). Critical: Only retry UNAVAILABLE status (service temporarily down), never DEADLINE_EXCEEDED (timeout) - retrying timeouts amplifies cascading failures under load.

Service Mesh: Linkerd

Decision: Linkerd over Istio

From Part 1: We need <5ms gateway overhead, sub-10ms service-to-service latency.

Benchmarks:

Linkerd P99: 5-10ms overhead
Istio P99: 15-25ms overhead
Academic validation: Istio added 166% latency with mTLS, Linkerd added 33%

Why Linkerd:

Lower latency: 5-10ms vs Istio’s 15-25ms
Lower resource usage: ~50MB memory per proxy vs Envoy’s ~150MB
Rust-based proxy: linkerd2-proxy is lighter than Envoy (C++)
gRPC-native: Zero-copy proxying for gRPC (our primary protocol)

Configuration:

Service profile for User Profile Service: Service Profile Configuration: Linkerd ServiceProfiles define per-route behavior for fine-grained traffic management:

GetProfile route: 10ms timeout, non-retryable (profile lookups must be fast or fail)
BatchGetProfiles route: 15ms timeout, retryable on 5xx errors with max 1 retry (batch operations tolerate single retry without cascading delays)

This per-route configuration ensures timeouts match Part 1’s latency budget while preventing retry storms during service degradation.

mTLS (Mutual TLS) Encryption:

Automatic certificate rotation every 24 hours prevents long-lived certificate compromise
Certificates issued by Linkerd’s built-in CA with trust-anchor certificate establishing root of trust
Zero application code changes - mTLS handled transparently at proxy layer, services communicate over plaintext internally

Traffic Splitting for Canary Deployments: Linkerd’s SMI TrafficSplit API enables gradual rollouts by weight-based routing:

90% traffic → stable version (proven reliability)
10% traffic → canary version (testing new deployment)
Monitor error rates, latency P99, and business metrics
If healthy, increase canary weight to 100% over 2-4 hours
If degraded, instant rollback by setting canary weight to 0%

This pattern (detailed in Part 4 Production Operations) reduces blast radius of defects while maintaining production velocity.

API Gateway: Envoy Gateway Decision

From Part 1’s latency budget, gateway operations (authentication, rate limiting, routing) must complete within 4-5ms to preserve 150ms SLO. Envoy Gateway achieves 2-4ms total overhead: JWT auth via ext_authz filter (1-2ms, cached 60s), rate limiting via Valkey token bucket (0.5ms atomic DECR), routing decisions (1-1.5ms). Production measurements: P50 2.8ms, P99 4.2ms.

Technology Comparison

Gateway	Latency Overhead	Memory per Pod	Operational Complexity	Kubernetes-Native
Envoy Gateway	2-4ms	50-80MB	Low (Envoy config only)	Gateway API native
Kong	10-15ms	150-200MB	Medium (plugin ecosystem learning curve)	CRD-based
Traefik	5-8ms	100-120MB	Medium (label-based config, less flexible)	Gateway API support
NGINX Ingress	3-6ms	80-100MB	Medium (annotation-heavy, error-prone)	Annotation-based

Kong rejected: 10-15ms latency (7-10% of budget), 150-200MB memory, different proxy tech from service mesh (Kong Lua + Istio Envoy = 20-30ms combined overhead). NGINX rejected: annotation-based config error-prone (nginx.ingress.kubernetes.io/rate-limit typo fails silently), no native gRPC support, external rate-limit sidecar complexity. Traefik rejected: label-based config insufficient for RTB’s sophisticated timeout/header transformation requirements.

Unified Proxy Stack with Linkerd Service Mesh

Platform handles two traffic patterns: north-south (external → cluster via Envoy Gateway) and east-west (internal service-to-service via Linkerd). Both use Envoy proxy technology, enabling smooth transitions without double-proxying overhead. Alternative (Kong + Istio) requires learning two proxies, separate observability, 20-30ms combined latency.

Traffic flow: External request → Envoy Gateway (TLS termination, JWT validation, rate limiting) → Linkerd sidecar (mTLS encryption, load balancing, retries) → Ad Server → internal calls via Linkerd (automatic mTLS, observability). Each service hop adds ~1ms Linkerd overhead; 3-4 hops = 3-4ms total, well within budget. Achieves zero-trust (every call authenticated/encrypted) without code changes.

Gateway API benefits: HTTPRoute enables per-DSP timeout policies and header transformations declaratively. ReferenceGrant provides namespace isolation for multi-tenant deployments. Native HTTP/2, gRPC, WebSocket support eliminates manual proxy_pass configuration for RTB bidstream.

Trade-off: Smaller plugin ecosystem vs Kong. Complex transformations (GraphQL→REST) implemented as dedicated microservices rather than gateway plugins, preserving low latency while allowing independent scaling.

Container Orchestration: Kubernetes

Why Kubernetes over Raw EC2/VMs

Kubernetes Provides:

Declarative Configuration: Define desired state, Kubernetes reconciles
Auto-Scaling: Horizontal Pod Autoscaler (HPA) scales based on metrics
Self-Healing: Automatic pod restarts, node failure recovery
Service Discovery: Built-in DNS, no external registry needed
Rolling Updates: Zero-downtime deployments with health checks
Multi-Region Federation: Cluster federation for global deployment

Why Not Raw EC2:

Manual scaling: Auto-scaling groups lack app-aware logic
No service discovery: Requires external registry (Consul, Eureka)
Deployment complexity: Blue-green deploys require custom automation
Resource utilization: VMs waste capacity, containers pack efficiently

Resource Efficiency Example:

EC2: 300 instances × 8 vCPU × 50% avg utilization = 1,200 vCPUs utilized
Kubernetes: 150 nodes × 16 vCPU × 80% avg utilization = 1,920 vCPUs utilized
Gain: (1,920 - 1,200) / 1,200 = 60%
Result: 60% more capacity from the same infrastructure via bin-packing

Kubernetes Architecture

Cluster Configuration:

Node count: 150 nodes across 3 regions (50 nodes per region)
Node type: c6i.4xlarge (16 vCPU, 32 GB RAM)
Pod density: ~10-12 pods per node (avg)
Total pods: ~1,500 pods across cluster
- 300 Ad Server Orchestrator instances
- 150 User Profile Service pods (50 per region)
- 150 ML Inference pods (50 per region)
- 150 RTB Gateway pods (50 per region)
- 90 Budget Service pods (30 per region)
- 90 Auction Service pods (30 per region)
- 90 Integrity Check pods (30 per region)
- 90 Redis/Valkey nodes (30 per region)
- 90 Kafka brokers (30 per region)
- 150 observability stack (Prometheus, Grafana, Tempo, Loki)
- 150 system pods (kube-system, ingress controllers, operators)

Namespaces:

production: Live traffic (1M QPS)
staging: Pre-production validation
canary: Traffic shadowing and A/B tests
monitoring: Prometheus, Grafana, Alertmanager

Auto-Scaling Strategy:

Horizontal Pod Autoscaler (HPA) monitors both CPU utilization (target: 70%) and custom metrics (requests per second per pod). Scaling triggers when pods exceed 5K QPS threshold. Scale-up happens aggressively (50% increase) with 60-second stabilization window, while scale-down is conservative (10% reduction) with 5-minute stabilization to avoid flapping. Minimum 200 pods ensures baseline capacity, maximum 400 pods caps burst handling.

Why containerd over Docker:

Lightweight: Lower overhead, faster pod startup
OCI-compliant: Standard container runtime interface
Kubernetes-native: First-class support, no shim layer

Data Layer: CockroachDB Cluster

Decision: CockroachDB over PostgreSQL/Spanner/DynamoDB

From Part 1 and Part 3: Need strongly-consistent transactional database for billing ledger, multi-region active-active, 10-15ms latency.

Why CockroachDB:

2-3× cheaper than DynamoDB at 1M+ QPS (see cost breakdown below)
Postgres-compatible - existing team expertise, tooling compatibility
HLC timestamps for linearizable billing events (Part 3 requirement)
Multi-region native - automatic replication, leader election
No vendor lock-in (vs Spanner’s Google-only deployment)

Cost comparison (1M QPS, 8 billion writes/day):

DynamoDB: 100% baseline (on-demand pricing per AWS published rates)
CockroachDB (60 compute nodes): ~45% of DynamoDB cost
Savings: ~55% infrastructure cost reduction

Cluster Topology

Day-1 Choice: CockroachDB Serverless

Fully managed by Cockroach Labs
Pay-per-use pricing (~40-50% of DynamoDB)
Auto-scaling capacity (no manual node management)
Same features as self-hosted (cross-region ACID, HLC, SQL)

Self-Hosted Configuration (if operational costs justify it):

60-80 nodes across 3 AWS regions (us-east-1, us-west-2, eu-west-1)
20-27 nodes per region (distributed across 3 availability zones)
Replication factor: 5 (2 replicas in home region, 1 in each remote region)
Node specs: c5.4xlarge (16 vCPU, 32GB RAM, 500GB NVMe SSD per node)

Why 60-80 nodes (self-hosted sizing): From benchmarks: CockroachDB achieves 400K QPS (99% reads) with 20 nodes, 1.2M QPS (write-heavy) with 200 nodes.

Our workload: ~70% reads, ~30% writes, 1M+ QPS total → 60-80 nodes provides headroom.

Sizing Strategy: Database is sized for sustained load (1M QPS baseline), while Ad Server instances are sized for peak capacity (1.5M QPS with 50% headroom). This is intentional: databases scale slowly (adding nodes requires rebalancing), while stateless Ad Servers scale instantly (spin up pods). During traffic bursts to 1.5M QPS, cache hit rates absorb most load (95% hits = only 75K additional DB queries), keeping database well within capacity.

Decision point: Evaluate self-hosted when infrastructure savings exceed operational costs. Break-even varies significantly: US-based SRE team (3-5 engineers) requires 20-30B req/day, while global/regional teams with existing database expertise may break even at 4-8B req/day. See Part 3’s database cost comparison for detailed break-even analysis with geographic and team structure scenarios.

Multi-Region Deployment:

Database Architecture: CockroachDB deployed with us-east-1 as primary region and us-west-2, eu-west-1 as secondary regions. The database is configured with SURVIVE REGION FAILURE semantics, requiring 5-way replication with a 2-1-1-1 replica distribution pattern (2 replicas in the primary region for fast quorum, 1 replica in each secondary region for disaster recovery).

Schema Design Decisions:

Billing Ledger Table uses several critical design patterns:

UUID primary keys: Globally unique identifiers enable conflict-free writes across regions without coordination, essential for multi-region active-active pattern from Part 4
Integer amount storage: DECIMAL type for financial precision eliminates floating-point rounding errors that would violate Part 3’s ≤1% accuracy requirement
HLC timestamp column: Hybrid Logical Clock (combination of physical timestamp + logical counter) provides linearizable ordering across regions for audit trails. Critical for resolving event ordering when physical clocks drift (addressed in Part 4’s clock synchronization)
Composite index: Campaign ID + event time enables efficient queries for billing reconciliation and dispute resolution without full table scans
REGIONAL BY ROW locality: Each row stored in the region closest to access pattern (determined by user geography), reducing cross-region queries from 50-100ms to 1-2ms for common operations

Connection Pooling:

Each Ad Server instance: 20 connections to CockroachDB cluster
Total: 300 instances × 20 connections = 6,000 connections across 60 nodes = 100 connections/node
CockroachDB limit: 5,000 connections/node - well within capacity

Latency breakdown:

Intra-AZ read: 1-2ms (single replica query)
Cross-AZ read (same region): 5-8ms (network latency)
Cross-region read: 10-15ms (Part 5 claim - applies to cross-region queries)

From Part 1: L3 cache (CockroachDB) is the fallback, accessed only on L1/L2 misses (5-10% of requests). The 10-15ms latency applies to these rare cross-region misses.

Capacity Planning & Sizing Model

Instance Count Formulas

Core Sizing Principle:

$$\text{Instance Count} = \frac{\text{Target QPS} \times 1.5}{\text{QPS per Instance}}$$

Safety Factor = 1.5 accounts for: traffic bursts, regional failover (one region down → 2 remaining absorb 50% more load), and deployment headroom.

Ad Server Orchestrator (Critical Path):

$$N_{ads} = \frac{Q_{target} \times 1.5}{5,000}$$

Example at 1M QPS: $$N_{ads} = \frac{1,000,000 \times 1.5}{5,000} = 300 \text{ instances}$$

Why 5K QPS per instance? Measured from load testing:

32GB heap with ZGC → 250 MB/sec allocation rate
200 virtual threads (Java 21 Loom) → handles concurrent RTB calls
gRPC connection pooling → 32 connections per downstream service
At 5K QPS: avg CPU 60-70%, P99 latency ~140ms (within 150ms SLO)

User Profile Service (Cache-Heavy):

$$N_{profile} = \frac{Q_{target} \times 1.5}{10,000}$$

Why 10K QPS per instance? Read-heavy workload:

L1 cache (60% hit) → sub-millisecond, no backend call
L2 cache (25% hit) → 2-3ms Valkey read
L3 database (15% miss) → 10-15ms CockroachDB read
Lightweight service: 4GB RAM, minimal CPU

ML Inference Service (Compute-Heavy):

$$N_{ml} = \frac{Q_{target} \times 1.5}{1,000}$$

Why only 1K QPS per instance? GBDT inference is CPU-intensive:

LightGBM with 200 trees, depth 6, 500+ features
~20ms P50, ~40ms P99 per prediction
16GB RAM for model + feature cache
4 vCPU fully utilized

RTB Gateway (I/O Bound):

$$N_{rtb} = \frac{Q_{target} \times 1.5}{10,000}$$

Why 10K QPS per instance? Network I/O bound, not CPU:

HTTP/2 connection pooling to 50+ DSPs
Async I/O (waiting for DSP responses, not computing)
Timeout handling at 100ms
Low memory footprint: 4GB RAM

Budget Service (Redis-Backed):

$$N_{budget} = \frac{Q_{target} \times 1.5}{50,000}$$

Why 50K QPS per instance? Extremely lightweight:

Single Redis EVAL call per request (atomic budget check)
3ms P50, 5ms P99 latency
Minimal CPU and memory (2GB RAM)
Network latency dominant, not compute

CockroachDB Sizing (Benchmark-Driven):

From official benchmarks:

Read-heavy (99% reads): 20 nodes → 400K QPS
Write-heavy (50% writes): 200 nodes → 1.2M QPS
Our workload (70% reads, 30% writes): Interpolate

$$N_{crdb} = 20 + \left(\frac{Q_{target} - 400K}{800K}\right) \times 180$$

Example at 1M QPS: $$N_{crdb} = 20 + \left(\frac{1M - 400K}{800K}\right) \times 180 = 20 + 135 = 155 \text{ nodes (theoretical)}$$

BUT: With 78-88% cache hit rate (from Part 3):

Only 12-22% of traffic hits database
Effective DB load: 120K-220K QPS
Actual sizing: 60-80 nodes (provides 2-3× headroom over effective load)

Valkey/Redis Sizing:

From Valkey 8.1 benchmarks: 1M RPS per 16 vCPU instance

$$N_{cache} = \frac{Q_{target} \times \text{Cache Traffic \%}}{1M}$$

Example at 1M QPS:

L2 cache handles: 25% of traffic (L1 misses)
Rate limiting: ~1M checks/sec (token bucket)
Budget pacing: ~1M atomic operations/sec
Total cache load: ~1.25M RPS
Instances needed: ~2 per region × 3 regions = 6 instances minimum, 30 for redundancy

Per-Service Resource Requirements

Service	vCPU/Pod	RAM/Pod	Heap (JVM)	QPS/Pod	Pods @ 1M QPS	Total vCPU	Total RAM	Notes
Ad Server Orchestrator	2	8GB	32GB	5,000	300	600	2,400GB	ZGC, virtual threads
User Profile Service	1	4GB	-	10,000	150	150	600GB	Cache-heavy, read-only
ML Inference Service	4	16GB	-	500-700	1,500-2,000	6,000-8,000	24,000-32,000GB	CPU GBDT (20ms inference, requires load testing)
RTB Gateway	2	4GB	16GB	10,000	150	300	600GB	HTTP/2, async I/O
Budget Service	2	4GB	16GB	1,200-1,500	600-800	1,200-1,600	2,400-3,200GB	Redis-backed (3ms async I/O, requires load testing)
Auction Service	2	4GB	16GB	10,000-15,000	70-100	140-200	280-400GB	In-memory ranking (requires load testing)
Integrity Check	2	4GB	16GB	2,000-3,000	300-500	600-1,000	1,200-2,000GB	Bloom filter + validation logic (requires load testing)
Feature Store (Tecton)	2	8GB	-	10,000	150	300	1,200GB	Managed service
CockroachDB Nodes	16	32GB	-	~17K	60	960	1,920GB	c5.4xlarge instances
Valkey Cache Nodes	8	64GB	-	~42K	30	240	1,920GB	r5.2xlarge instances
Kafka Brokers	8	32GB	-	-	30	240	960GB	Event streaming
Observability Stack	-	-	-	-	150	300	600GB	Prometheus, Grafana, Loki
System Pods	-	-	-	-	150	200	400GB	kube-system, controllers
TOTAL	-	-	-	-	~4,000-4,500	~12,500-13,500	~43,000-46,000GB	1M QPS baseline

Key Insights:

ML Inference dominates compute: 6,000-8,000 vCPUs (48-60% of total) for CPU-based GBDT prediction - see Part 2 for CPU vs GPU trade-off analysis
Budget Service requires significant resources: 1,200-1,600 vCPUs (10-12% of total) despite lightweight operations - async I/O throughput limited by CPU for gRPC parsing/serialization
Memory requirements: ~43-46TB total RAM across ~200-250 Kubernetes nodes (c6i.4xlarge: 16 vCPU, 32GB RAM or similar)
Pod density: ~16-20 pods per node average (4,000-4,500 pods / 200-250 nodes)
Database is ~7-8% of compute: 960 vCPUs (CockroachDB) vs 12,500-13,500 total - cache effectiveness reduces DB load
All QPS estimates require validation: Throughput calculations based on theoretical CPU time per request - load testing mandatory to validate and optimize actual performance

Throughput Estimates: Validation with External Benchmarks

All QPS/Pod estimates are derived from external production benchmarks and theoretical analysis. Each service estimate is validated against published research and real-world case studies.

External Benchmark Baseline:

Industry benchmarks establish realistic throughput expectations for Java microservices:

gRPC Java servers: ~5,000 QPS per core (tuned), 245K QPS on 8-core VM
Spring Boot production: 1.2M requests/sec peak (optimized), 50K baseline, 31K simple reactive
Redis throughput: 100K+ QPS typical, 1M+ QPS optimized single instance
HTTP/2 gateways: Envoy ~18.5K RPS, Nginx ~15K RPS (benchmark), millions in production (Dropbox)
Virtual threads: 120K+ req/sec with java-http library

Service-by-Service Validation:

1. Ad Server Orchestrator (5,000 QPS per pod, 2 vCPU)

External validation:

Spring Boot with virtual threads: Designing systems for 5000+ QPS
gRPC benchmark: 5,000 QPS per core is industry standard for tuned systems

Our calculation:

Request orchestration: gRPC parsing (0.3ms) + service coordination (0.1ms) + response (0.1ms) = 0.5ms CPU
With virtual threads handling I/O wait for downstream calls (parallel ML + RTB)
Theoretical: 2 cores × 1000ms / 0.5ms = 4,000 QPS
With JVM overhead, GC (ZGC 10-15%), network variance: 5,000 QPS realistic

Confidence: HIGH - aligns with published Spring Boot microservice benchmarks at 5K+ QPS

2. User Profile Service (10,000 QPS per pod, 1 vCPU)

External validation:

Redis client throughput: 100K+ QPS achievable from single client with pipelining
Cache-heavy read service with minimal CPU processing

Our calculation:

Cache hit path (85% of requests): gRPC parsing (0.3ms) + local cache lookup (0.01ms) + response (0.1ms) = 0.41ms CPU
Cache miss path (15%): + Redis network call (5ms I/O, 0.1ms CPU overhead) = 0.51ms CPU
Weighted average: 0.85 × 0.41ms + 0.15 × 0.51ms = 0.42ms CPU per request
Theoretical: 1000ms / 0.42ms = ~2,400 QPS per core
With virtual threads allowing 4-5× concurrency for I/O-bound work: 10,000 QPS achievable

Confidence: MEDIUM-HIGH - depends on virtual thread efficiency for I/O wait. Actual validation needed.

3. ML Inference Service (500-700 QPS per pod, 4 vCPU)

External validation:

GBDT CPU inference: 10-20ms documented in production case studies
LightGBM/XGBoost: CPU-bound, no I/O wait

Our calculation:

GBDT inference: 20ms CPU (from Part 1 latency budget)
gRPC overhead: 0.5ms
Total: 20.5ms CPU per request
Theoretical: 4 cores × 1000ms / 20.5ms = 195 QPS
With batching (2-4 requests per batch) and optimizations: 500-700 QPS realistic

Confidence: HIGH - based on documented GBDT inference latency. Conservative estimate assumes no aggressive batching.

4. RTB Gateway (10,000 QPS per pod, 2 vCPU)

External validation:

HTTP/2 gateway benchmarks: Envoy ~18.5K RPS, production millions
Async I/O workload (fan-out to 50 DSPs, collect responses)

Our calculation:

Request parsing + fan-out coordination: 0.5ms CPU
Network I/O to DSPs: 100ms wait (async, non-blocking)
Response aggregation: 0.3ms CPU
Total CPU: 0.8ms per request
Theoretical: 2 cores × 1000ms / 0.8ms = 2,500 QPS
With async I/O allowing high concurrency: 10,000 QPS realistic

Confidence: HIGH - aligns with HTTP/2 gateway benchmarks showing 15K-18K RPS per instance

5. Budget Service (1,200-1,500 QPS per pod, 2 vCPU)

External validation:

gRPC with Redis: Industry baseline ~1,000-2,000 QPS per core for I/O-bound workloads
Redis single operation latency: 3ms (from Part 1)

Our calculation:

gRPC parsing: 0.3ms CPU
Redis DECRBY call: 3ms total (2.5ms I/O wait + 0.5ms CPU for client)
Response: 0.2ms CPU
Total CPU: 1.0ms per request
Theoretical max: 2 cores × 1000ms / 1.0ms = 2,000 QPS per pod
Provisioned target: 1,200-1,500 QPS per pod (60-75% utilization)

Rationale: We run pods at 60-75% of theoretical capacity (not 100%) to handle:

ZGC pause-less collection (consumes 10-15% CPU even with low pauses)
Network variance and TCP retransmissions
Pod restarts and rolling deployments
Sudden traffic spikes within degradation buffer

Confidence: MEDIUM-HIGH - conservative estimate. May achieve higher with connection pooling optimizations.

6. Auction Service (10,000-15,000 QPS per pod, 2 vCPU)

External validation:

In-memory ranking algorithms: sub-millisecond CPU time
No I/O, pure CPU computation

Our calculation:

eCPM ranking (200 candidates): 0.1ms CPU (array sort)
Winner selection + quality scoring: 0.05ms CPU
gRPC overhead: 0.3ms CPU
Total: 0.45ms CPU per request
Theoretical: 2 cores × 1000ms / 0.45ms = 4,400 QPS
With optimizations (SIMD, cache locality): 10,000-15,000 QPS achievable

Confidence: MEDIUM - highly dependent on ranking algorithm complexity. Estimate assumes simple eCPM sort.

7. Integrity Check (2,000-3,000 QPS per pod, 2 vCPU)

External validation:

Bloom filter operations: microsecond-level CPU time
Hash computation + validation logic adds overhead

Our calculation:

gRPC parsing: 0.3ms CPU
Hash computation (xxHash): 0.1ms CPU
Bloom filter check: 0.05ms CPU (bitwise operations)
IP blacklist check: 0.1ms CPU
Device fingerprint validation: 0.15ms CPU
Response: 0.2ms CPU
Total: 0.9ms CPU per request
Theoretical: 2 cores × 1000ms / 0.9ms = 2,200 QPS
With overhead: 2,000-3,000 QPS realistic

Confidence: MEDIUM - depends on validation logic complexity beyond Bloom filter.

8. Feature Store (10,000 QPS per pod, 2 vCPU) - Tecton Managed

External validation:

Managed service (Tecton) - vendor optimized
Feature serving optimized for low-latency lookups

Estimate based on:

Tecton documentation: sub-10ms p99 latency target
Similar to User Profile Service (cache-heavy reads)
10,000 QPS reasonable for managed service

Confidence: LOW - vendor-specific performance. Requires Tecton documentation validation.

Overprovisioning Strategy: Why We Don’t Run at 100% Capacity

All QPS estimates represent provisioned capacity at 60-75% utilization, not theoretical maximum throughput. This is a deliberate architectural decision from Part 1’s GC analysis.

Theoretical vs Provisioned Example (Budget Service):

Theoretical max: 2,000 QPS per pod (2 vCPU × 1000ms / 1.0ms CPU per request)
Provisioned target: 1,200-1,500 QPS per pod
Utilization: 60-75% of theoretical max

Why we overprovision (25-40% extra capacity):

ZGC overhead: Even pause-less GC consumes 10-15% CPU for concurrent marking and compaction
Rolling deployments: During updates, 20-30% of pods are unavailable (graceful shutdown + warmup)
Network variance: TCP retransmissions, health checks, DNS lookups add 5-10% overhead
Traffic spikes: Sudden bursts within degradation thresholds require immediate capacity
Pod failures: Individual pod crashes should not trigger cascading degradation

This is not waste - it’s insurance against SLO violations.

Running services at 95-100% CPU utilization means:

Any GC pause causes request queuing and latency spikes
Rolling deployments trigger circuit breakers
Minor traffic increases violate SLOs
No buffer for degradation scenarios

Trade-off: 25-40% more infrastructure cost → avoid catastrophic failures and SLO violations

Example calculation (Budget Service at 1M QPS, 70% traffic needs budget check):

Total budget checks needed: 700K QPS
Theoretical capacity: 700K / 2,000 QPS/pod = 350 pods minimum
Actual provisioning: 600-800 pods (71-128% overprovisioning)
This accounts for: ZGC (10-15%), deployments (20%), variance (10%), buffer (10-20%)

Critical Dependencies:

All estimates assume:

Java 21+ with virtual threads enabled for I/O-bound services
ZGC (low-pause garbage collector) configured properly
Proper connection pooling (Redis, gRPC channels)
Network latency within same availability zone (1-2ms)
Target utilization 60-75% sustained, 85-90% peak

Load testing validates both theoretical max AND safe utilization thresholds to determine optimal provisioning ratios.

Multi-Scale Cost Projections

Infrastructure Cost Components:

Compute (Kubernetes Nodes): Standard compute instances × node count
Database (CockroachDB Self-Hosted): Compute instances × node count
Cache (Valkey): Memory-optimized instances × node count
Network Egress: Per-GB charges for RTB traffic to DSPs (50+ partners)
Managed Services: Tecton (feature store), monitoring, storage, etc.

Scale	QPS	Compute Nodes	DB Nodes	Cache Nodes	Relative Total Cost	Cost Scaling Factor
Small	100K	15	15	6	15%	0.15× baseline
Medium	500K	75	40	15	55%	0.5× baseline
Baseline	1M	150	60	30	100%	1.0× (reference)
Large	5M	750	200	90	440%	4.5× baseline

Cost composition @ 1M QPS baseline: Compute 53%, Database 21%, Cache 8%, Network egress 7%, Managed services 11%.

Key insight: Cost scales sub-linearly - 5× QPS increase = 4.5× cost (not 5×) due to fixed infrastructure amortization.

Break-Even Analysis: CockroachDB vs DynamoDB

Pricing Model Comparison:

DynamoDB: Linear per-request pricing (published AWS rates: $0.625/M writes, $0.125/M reads on-demand)
CockroachDB: Fixed infrastructure cost (compute nodes) amortized across requests

1M QPS workload (8B requests/day, 70% reads, 30% writes):

DynamoDB: 100% baseline (reference)
CockroachDB: ~45% of DynamoDB cost (60 compute nodes)
Savings: ~55% infrastructure cost

Break-Even Analysis by Scale:

Scale	Daily Requests	DynamoDB Cost	CRDB Cost	Cost Ratio	Winner
100K QPS	864M	100%	90%	0.9×	DynamoDB (10% cheaper)
500K QPS	4.3B	100%	50%	0.5×	CRDB (2× cheaper)
1M QPS	8.6B	100%	45%	0.45×	CRDB (2.5× cheaper)
5M QPS	43B	100%	30%	0.3×	CRDB (3.5× cheaper)

Why economics flip: DynamoDB’s linear per-request pricing becomes expensive at scale, while CockroachDB’s fixed infrastructure cost amortizes across growing traffic. Crossover at ~150-200K QPS where self-hosted operational complexity becomes justified by cost savings.

Capacity Planning Decision Flow

    
    graph TD
    START[Start: Target QPS?] --> SCALE{QPS Level?}

    SCALE -->|< 100K QPS| SMALL[Small Scale Strategy]
    SCALE -->|100K - 1M QPS| MEDIUM[Medium Scale Strategy]
    SCALE -->|1M - 5M QPS| LARGE[Large Scale Strategy]
    SCALE -->|> 5M QPS| XLARGE[Extra Large Scale Strategy]

    SMALL --> SMALL_DB{Database Choice}
    SMALL_DB --> SMALL_CRDB[CRDB Serverless
Managed, auto-scale
~0.15× baseline]
    SMALL_DB --> SMALL_DYNAMO[DynamoDB
Pay-per-use
~0.15× baseline]

    MEDIUM --> MEDIUM_INFRA[Infrastructure Sizing]
    MEDIUM_INFRA --> MEDIUM_COMPUTE[Compute: 50-150 nodes
DB: 30-60 CRDB nodes
Cache: 10-30 Valkey]
    MEDIUM_INFRA --> MEDIUM_COST[Cost: ~0.5× baseline
Break-even: CRDB wins]

    LARGE --> LARGE_INFRA[Production Scale]
    LARGE_INFRA --> LARGE_COMPUTE[Compute: 150-750 nodes
DB: 60-200 CRDB nodes
Cache: 30-90 Valkey]
    LARGE_INFRA --> LARGE_MULTI[Multi-Region Required
3+ regions active-active
Cost: 1-4× baseline]

    XLARGE --> XLARGE_INFRA[Hyper Scale]
    XLARGE_INFRA --> XLARGE_SHARD[Geographic Sharding
Regional autonomy
Cost: 4×+ baseline]
    XLARGE_INFRA --> XLARGE_OPT[Custom Optimizations
ASICs for ML inference
CDN for static content]

    SMALL_CRDB --> VALIDATE[Validate Requirements]
    SMALL_DYNAMO --> VALIDATE
    MEDIUM_COST --> VALIDATE
    LARGE_MULTI --> VALIDATE
    XLARGE_OPT --> VALIDATE

    VALIDATE --> CHECK_LATENCY{Meet 150ms
P99 SLO?}
    CHECK_LATENCY -->|No| OPTIMIZE[Optimize:
- Add cache capacity
- Increase pod count
- Tune GC settings]
    CHECK_LATENCY -->|Yes| CHECK_COST{Budget
acceptable?}

    OPTIMIZE --> CHECK_LATENCY

    CHECK_COST -->|No| REDUCE[Cost Reduction:
- Managed services
- Right-size instances
- Reserved capacity]
    CHECK_COST -->|Yes| DEPLOY[Deploy & Monitor]

    REDUCE --> CHECK_COST

    DEPLOY --> MONITOR[Continuous Monitoring]
    MONITOR --> ADJUST{Need to scale?}
    ADJUST -->|Yes| SCALE
    ADJUST -->|No| MONITOR

    style START fill:#e1f5ff
    style DEPLOY fill:#d4edda
    style VALIDATE fill:#fff3cd
    style OPTIMIZE fill:#f8d7da
    style REDUCE fill:#f8d7da

Critical Sizing Insights:

ML Inference dominates: 6,000-8,000 vCPUs (48-60% of total) - explains why CPU-based GBDT was chosen over GPU (cost, operational simplicity)
Cache reduces DB by 5-8×: 78-88% hit rate turns 1M QPS into 120-220K effective database load
Cost crossover at 200K QPS: DynamoDB wins below 200K, self-hosted CRDB provides 2×+ savings above
Cost scales sub-linearly: 5× QPS increase = 4.5× cost increase (fixed infrastructure amortizes)

Hardware Evolution Strategy: CPU-First Architecture

This section clarifies our long-term ML infrastructure evolution path and explains the CPU-only architecture decision.

Design Philosophy: Start Simple, Evolve Deliberately

We deliberately chose CPU-only infrastructure for ML inference despite GPU being the “standard” choice in ML serving. This decision trades some model complexity ceiling for significant operational and cost benefits.

Phase 1: Day 1 - CPU GBDT (Current)

Infrastructure:

1,500-2,000 CPU pods (4 vCPU, 16GB RAM each)
Standard c6i.4xlarge instances (no GPU drivers, no CUDA)
LightGBM/XGBoost models served via standard HTTP/gRPC

Performance:

10-20ms GBDT inference latency
500-700 QPS per pod
Total capacity: 1M-1.4M QPS (1M baseline + 40% headroom)

Model characteristics:

100-150 trees, depth 6-8
200-500 features
Model size: 50-150MB
AUC target: 0.78-0.82

Advantages:

Simple deployment (no GPU orchestration complexity)
Fast iteration (standard Kubernetes HPA, no specialized hardware)
Low cost (30-40% cheaper than GPU for GBDT workloads)
Team velocity (engineers familiar with CPU deployment)

Limitations accepted:

Cannot run large neural networks (yet)
10-20ms latency floor (vs 8-15ms on GPU)
Lower throughput per pod (500-700 vs 1,000-1,500 QPS)

Phase 2: 6-12 Months - Two-Stage Ranking with Distilled DNN (Planned)

Infrastructure addition:

Same CPU pods (no hardware changes!)
Add ONNX Runtime with INT8 quantization support
Deploy distilled DNN models alongside GBDT

Architecture:

Stage 1 - GBDT Candidate Generation (5-10ms):
- Existing CPU GBDT model
- Reduce 10M ads → 200 top candidates
- Unchanged from Phase 1
Stage 2 - DNN Reranking (10-15ms):
- Distilled neural network (60-100M parameters)
- INT8 quantized, ONNX optimized
- Scores only top-200 candidates (not all 10M)
- Runs on same CPU infrastructure

Performance:

Combined latency: 15-25ms (within 40ms budget)
Expected AUC improvement: +1-2% (0.80-0.84 range)
Revenue impact: +5-10% from better targeting

Requirements to unlock this phase:

Build distillation pipeline (teacher-student training)
INT8 post-training quantization
ONNX Runtime integration
Load testing to validate 10-15ms DNN latency on CPU

Model characteristics (DNN reranker):

Architecture: DistilBERT-class or small transformer (60-100M params)
Quantization: INT8 (4× size reduction, 25-50% latency improvement)
Input: Top-200 candidates + user features
Model size: 100-200MB (post-quantization)

Proven CPU DNN latency (external validation):

DistilBERT p50 <10ms on CPU with ONNX quantization
E5-base-v2 15ms on CPU (3.5× improvement via quantization)
INT8 quantization achieves 20-80ms for larger models on Intel Xeon

Phase 3: 18-24 Months - Decision Point (GPU Migration or Continue CPU)

At this phase, we evaluate whether CPU architecture has reached its limits:

Option 3A: Continue CPU evolution (if model quality sufficient)

Stick with CPU if:

AUC 0.82-0.84 meets business goals
Cost savings (30-40% vs GPU) outweigh marginal quality gains
Operational simplicity valued over cutting-edge models

Next steps:

Further model compression (pruning, distillation)
Experiment with smaller model architectures (MobileNet-style)
Optimize inference pipeline (batching, multi-threading)

Option 3B: Add GPU pool (if hitting CPU ceiling)

Migrate to hybrid CPU+GPU if:

Need AUC >0.85 (requires larger transformers, >100M params)
Research team wants to experiment with large pre-trained models (BERT-Large, etc.)
Business justifies 30-40% infrastructure cost increase for quality gains

Migration path:

Deploy small GPU pool (50-100 pods with T4/A10g GPUs)
Run A/B test (GPU vs CPU DNN reranker)
Gradually shift traffic if GPU shows ROI
Estimated migration time: 3-6 months (GPU orchestration, model adaptation, load testing)
Cost impact: +30-40% infrastructure cost (+15-20% total platform cost)

Trade-Off Analysis: What We Explicitly Accept

By choosing CPU-first architecture, we are deliberately accepting:

Advantages:

Cost efficiency: 30-40% infrastructure cost reduction vs GPU for GBDT workloads at 1M QPS
Faster time-to-market: CPU deployment expertise widely available
Lower operational risk: Fewer components to fail (no GPU drivers, CUDA versions)
Easier troubleshooting: Standard CPU profiling tools vs specialized GPU tools
Portability: Runs on any cloud provider without GPU availability constraints

Trade-offs:

Model size ceiling: Limited to ~100M parameter models (DistilBERT-class) in Phase 2
- Cannot easily run BERT-Large (340M), GPT-style models (billions)
- Impact: Potential 1-2% AUC gap vs unlimited model complexity
Research flexibility: 2-4 month lag to productionize cutting-edge models
- Must wait for distilled versions or conduct distillation internally
- Cannot quickly experiment with latest research from arXiv
Future migration cost: If we hit CPU ceiling, GPU migration takes 3-6 months
- Need to build GPU orchestration from scratch
- Re-architect model serving pipeline
- Mitigation: Decision is reversible, just expensive to reverse

Why This Makes Sense for Our Use Case:

Our constraints favor CPU-first:

Scale: 1M QPS scale where 30-40% cost reduction justifies operational effort
Business: Ad platform ROI from 0.80→0.82 AUC is substantial (5-10% revenue)
Timeline: 6-12 month deployment cadence allows careful evolution
Team: Engineering-heavy team (vs research-heavy) values operational simplicity

When CPU-First Might NOT Make Sense:

Choose GPU from Day 1 if:

Low scale (<100K QPS): Cost difference negligible, GPU premium worth flexibility
Research-driven: Team wants to experiment with large models immediately
High-margin business: Can afford 30-40% premium for marginal quality gains
Existing GPU expertise: Team already has GPU ML infrastructure experience

Summary: Deliberate Architecture Constraints

Our CPU-first architecture is not a compromise—it’s a deliberate choice optimizing for cost, operational simplicity, and team velocity at 1M QPS scale. We accept model complexity constraints (100M param ceiling in Phase 2) in exchange for 30-40% infrastructure cost savings and faster iteration.

The evolution path (Phase 1 GBDT → Phase 2 two-stage CPU DNN → Phase 3 decision point) allows us to extract 80-90% of ML value without GPU complexity. If we hit the CPU ceiling in 18-24 months, we have a clear migration path to GPU—but we’ll have achieved significant cost savings and learned what model quality truly requires.

See Part 2 ML Architecture for detailed technical justification and external research validation.

Distributed Cache: Valkey (Redis Fork)

Decision: Valkey over Redis 7.x / Memcached

From Part 3: Need atomic operations (DECRBY for budget pacing), sub-ms latency, 1M+ QPS capacity.

Why Valkey over Redis:

Licensing: BSD-3 (permissive) vs Redis SSPL (restrictive)
Performance: Valkey 8.1 achieves 999.8K RPS with 0.8ms P99 latency (research-validated)
Community: Linux Foundation backing, active development
Compatibility: Drop-in replacement for Redis 7.2

Why Valkey over Memcached:

Atomic operations: DECRBY, INCRBY for budget pacing (Memcached lacks atomics)
Data structures: Lists, sets, sorted sets for complex caching
Persistence: AOF/RDB for durability (Memcached is volatile-only)

Cluster Architecture

Configuration:

20 nodes across 3 AWS regions (primary: 12 nodes, secondary: 4+4 nodes)
Node specs: r5.2xlarge (8 vCPU, 64GB RAM per node)
Sharding: 16,384 hash slots, evenly distributed across 20 nodes (~819 slots/node)
Replication: Each master has 1 replica (40 total nodes including replicas)

Why 20 nodes: From benchmarks: Valkey 8.1 achieves 1M RPS on a 16 vCPU instance. Our workload: 1M+ QPS across L2 cache + budget counters + rate limiting.

L2 cache hit rate: 25% (from Part 3) → 250K QPS
Budget operations: ~50K QPS (atomic DECRBY on every ad serve)
Rate limiting: 1M QPS (token bucket checks)
Total: ~1.3M operations/sec → 20 nodes provides 2× headroom

Cluster Configuration:

Memory Management: Valkey configured with 48GB heap allocation (out of 64GB total node memory), leaving 16GB for operating system page cache and kernel buffers. This ratio (75% application / 25% OS) optimizes for large working sets while preventing OOM conditions. Eviction policy uses allkeys-lru (least recently used) to automatically evict cold keys when memory pressure occurs, ensuring the cache remains operational under high load without manual intervention.

Durability Strategy: Append-Only File (AOF) persistence enabled with everysec fsync policy. This provides a middle ground between performance and durability:

Writes acknowledged immediately (sub-ms latency)
Fsync batches buffered writes to disk every 1 second
Maximum data loss window: 1 second of writes in catastrophic failure
Trade-off: Stronger than no persistence, faster than per-write fsync (which would add 5-10ms per operation)

Cluster Mode Configuration:

Distributed hash slots (16,384 slots): Enable horizontal sharding across 20 nodes without manual key distribution
Node timeout (5 seconds): Cluster detects failed nodes within 5 seconds and triggers automatic failover to replica
Authentication required: Strong password authentication prevents unauthorized access, critical for protecting budget counters from manipulation

Network Binding: Configured to listen on all interfaces (0.0.0.0) with protected mode enabled, allowing inter-cluster communication while requiring authentication for external connections. Essential for Kubernetes pod-to-pod communication across availability zones.

Atomic Budget Operations (Lua Script):

From Part 3: Budget pacing uses atomic DECRBY to prevent overspend.

Atomic Check-and-Deduct Pattern: Budget validation requires a check-then-deduct operation that must execute atomically to prevent overspend. The pattern reads the current budget counter from Valkey, validates sufficient funds exist for the requested ad impression cost, and decrements the counter only if funds are available - all as a single atomic transaction.

Why Lua Scripting:

Atomicity guarantee: Entire script executes as single Redis transaction without interleaving from other clients, eliminating race conditions where two Ad Server instances simultaneously check and deduct from the same campaign budget
Server-side execution: Multi-step conditional logic (check balance → deduct if sufficient) executes within Valkey process, avoiding 3 round-trips (GET, check in application, DECRBY) that would add 2-3ms latency and introduce race windows
Consistency under load: At 1M+ QPS with 300 Ad Server instances, network-based locking (SETNX) would create contention hotspots. Lua scripts provide lock-free atomicity with <0.1ms execution time

Script Execution Model: Pre-loaded into Valkey using SCRIPT LOAD, invoked by SHA-1 hash to avoid network overhead of sending script text on every request. Application code passes campaign key and deduction amount as parameters, receives binary success/failure response. This pattern achieves the ≤1% overspend guarantee from Part 3 by ensuring no concurrent modifications can occur between balance check and deduction.

Sharding Strategy:

Hash slot calculation: CRC16(key) mod 16384
Keys for same campaign co-located: campaign:{id}:budget, campaign:{id}:metadata use same hash tag {id}
Ensures atomic operations on related keys hit same node

Immutable Audit Log: Technology Stack

Compliance Requirement and Technology Decision

From Part 3’s audit log architecture: CockroachDB operational ledger is mutable (allows UPDATE/DELETE for operational efficiency), violating SOX and tax compliance requirements. Regulators require immutable, cryptographically verifiable financial records with 7-year retention for audit trail integrity.

Solution: Kafka + ClickHouse Event Sourcing Pattern

Platform selected Kafka + ClickHouse over AWS QLDB based on four factors. First, proven industry pattern validated at scale (Netflix KV DAL, Uber metadata platform operate similar architectures at 1M+ QPS). Second, query performance advantage: ClickHouse columnar OLAP delivers sub-500ms audit queries compared to QLDB PartiQL requiring 2-5 seconds for equivalent aggregations over billions of rows. Third, operational familiarity: platform already operates both technologies (Kafka for event streaming, ClickHouse for analytics dashboards), reusing existing expertise reduces learning curve. Fourth, AWS deprecation signal: AWS documentation (2024) recommends migrating QLDB workloads to Aurora PostgreSQL, indicating reduced investment in ledger-specific database.

QLDB rejected due to vendor lock-in (AWS-only, no multi-cloud option), query language barrier (PartiQL requires finance team retraining vs standard SQL), and OLAP performance lag for analytical compliance workloads (tax reporting aggregations, multi-year dispute investigations).

Implementation and Performance Characteristics

ClickHouse consumes financial events from Kafka via Kafka Engine table, transforms via Materialized View into columnar MergeTree storage. Configuration optimized for audit access patterns: monthly partitioning by timestamp enables efficient pruning for annual tax queries, ordering key (campaignId, timestamp) co-locates campaign history for fast sequential scans, ZSTD compression achieves 65% reduction (200GB/day raw → 70GB/day compressed). System delivers 100K events/sec ingestion throughput with <5 second end-to-end lag (event published → queryable), sub-500ms query latency for most audit scenarios (campaign spend history, dispute investigation). Full configuration details in Part 3.

Resource Trade-Offs and Operational Impact

Additional Infrastructure Required:

Compliance architecture adds dedicated resources beyond operational systems. ClickHouse cluster: 8 nodes with 3× replication factor across availability zones, consuming approximately 24 compute instances total. Storage footprint: 180TB for 7-year compliance retention (70GB/day × 365 days × 7 years), representing 15-20% additional storage compared to operational database infrastructure baseline (CockroachDB + Valkey). Kafka brokers: 12 nodes reused from existing event streaming infrastructure (impression/click events already flow through same cluster), marginal incremental capacity required.

Ingestion and Query Resource Usage:

ClickHouse ingestion consumes CPU cycles for JSON parsing, columnar transformation, compression, and replication. At 100K events/sec, ingestion workload averages 30-40% CPU utilization per node during peak hours, leaving headroom for query workload. Query resource consumption varies by complexity: simple aggregations (monthly campaign spend) consume <1 CPU-second, complex multi-year tax reports consume 5-10 CPU-seconds. Daily reconciliation job (compares operational vs audit ledgers) runs during off-peak hours (2AM UTC), consuming ~5 minutes CPU time across cluster.

Operational Overhead:

Compliance infrastructure introduces ongoing operational burden. Monitoring: Kafka consumer lag alerts (detect ingestion delays >1 minute), ClickHouse query latency dashboards (ensure audit queries remain sub-second), storage growth tracking (project retention capacity needs). Retention policy enforcement: monthly automated job drops partitions >7 years old, archives to S3 cold storage, validates hash chain integrity. Daily reconciliation: automated Airflow job compares ledgers, alerts on discrepancies >0.01 per campaign, typically finds 0-3 mismatches out of 10,000+ campaigns requiring investigation. Incident response: estimated 2-4 hours/month for discrepancy investigation, schema evolution coordination between operational and audit systems.

Benefit Justifies Resource Cost:

Compliance infrastructure prevents regulatory violations (SOX audit failures, IRS tax disputes), enables advertiser billing dispute resolution with cryptographically verifiable records (hash-chained events prove tampering), and satisfies payment processor requirements (Visa/Mastercard mandate immutable transaction logs). Resource investment (24 ClickHouse nodes, 180TB storage, operational monitoring) eliminates legal/financial risk exposure from non-compliant mutable ledgers.

Fraud Detection: Multi-Tier Pattern-Based System

Architecture Overview

From Part 4’s fraud detection analysis: 10-30% of ad traffic is fraudulent (bots, click farms, invalid traffic). The multi-tier detection architecture catches fraud progressively with increasing sophistication:

Three-Tier Detection Strategy:

L1 (Pre-RTB): Fast pattern matching blocks 20-30% of blatant bot traffic BEFORE expensive RTB fan-out
L2 (Post-Auction): Behavioral analysis catches 50-60% of sophisticated bots using device fingerprinting
L3 (Batch ML): Anomaly detection identifies 70-80% of advanced fraud patterns via 24-hour batch analysis

L1: Integrity Check Service (Go) - Real-Time Filtering

Technology Choice: Go over Java/Python

Sub-millisecond latency: Go’s compiled nature and lightweight runtime achieves <0.5ms P99 for Bloom filter lookups
Minimal memory footprint: 50-100MB per instance vs 1-2GB for JVM-based services, enabling higher pod density
Stateless design: Each instance loads 18MB Bloom filter into memory at startup, no external dependencies during request path

Implementation Architecture:

Bloom Filter for Known Malicious IPs:

Capacity: 10 million IP addresses with 0.1% false positive rate
Memory: 18MB in-process data structure (MurmurHash3 with 7 hash functions)
Update frequency: Refreshed every 5 minutes from shared Redis key populated by L3 batch analysis
Deployment: Runs as sidecar container alongside Ad Server pods (localhost communication eliminates network hop)

IP Reputation Cache (Redis-backed):

Stores last-seen timestamps for IP addresses exhibiting suspicious patterns
TTL: 24 hours (IPs age out automatically without manual cleanup)
Lookup latency: <1ms via L2 Valkey cache
Pattern: Rate-limited parallel lookup (don’t block request if Redis slow)

Device Fingerprinting (Basic):

User-Agent parsing: Detect headless browsers (Puppeteer, Selenium indicators)
Header validation: Missing or malformed required headers (Accept-Language, Referer)
Execution time: <0.2ms via pre-compiled regex patterns

Latency Budget: 5ms allocated in Part 1, executes in 0.5-2ms (measured p95), leaving 3-4.5ms buffer.

Key Trade-Off: Accept 0.1% false positive rate (blocking ~1,000 legitimate requests/second at 1M QPS) to prevent 200,000-300,000 fraudulent requests from consuming RTB bandwidth. The ROI is compelling: 5ms latency investment blocks 20-30% traffic, saving massive egress costs to 50+ DSPs.

L2: Behavioral Analysis Service - Post-Auction Pattern Detection

Architecture: Asynchronous processing pipeline (NOT in request critical path)

Trigger: Ad Server publishes click/impression events to Kafka after serving response to user. Fraud Analysis Service consumes events in real-time with <1s lag.

Detection Patterns:

Click-Through Rate Anomalies:

Calculate per-campaign CTR over 1-hour sliding windows
Flag campaigns with CTR >5× platform median (potential click fraud)
Cross-reference with device fingerprint diversity (legitimate traffic shows device variety)

Velocity Checks:

Track impressions-per-IP over 5-minute windows
Threshold: >100 impressions/5min from single IP triggers investigation
Combines with user-agent analysis: Same UA + High velocity = Strong fraud signal

Geographic Impossibility:

Detect user appearing in multiple distant locations within short timeframe
Example: Ad impression in New York at 10:00 AM, London at 10:05 AM = Physically impossible
Implementation: Redis geohash proximity check (<3ms)

Processing Architecture:

Flink streaming job: Consumes Kafka events, performs stateful aggregations (sliding windows)
State backend: RocksDB for incremental checkpointing (recovery within 30s of failure)
Output: Suspected fraud events written to separate Kafka topic for L3 analysis + immediate blocking (IP added to Redis reputation cache)

Latency: Fully asynchronous, 5-15ms average processing time doesn’t impact request latency

L3: ML-Based Anomaly Detection - Batch Gradient Boosted Decision Trees

Model Architecture: GBDT (same as CTR prediction, different training data)

Trees: ~200 trees, depth 6 - 8
Features: ~40 features across behavioral, temporal, and device dimensions
Training frequency: Daily batch retraining on previous 7 days of labeled data
Deployment: Model updated via blue-green deployment (shadow scoring validates new model before promotion)

Feature Categories:

Behavioral Features (~20):

Impressions/click ratio per user/device/IP
Session duration distribution
Navigation patterns (direct vs organic)
Ad interaction timing (clicking too fast suggests automation)

Temporal Features (~10):

Hour-of-day distribution (bots often show flat 24-hour activity)
Day-of-week patterns
Burst detection (sudden spike in activity)

Device Features (~10):

Screen resolution distribution
Browser/OS combinations
JavaScript execution capabilities
Touch vs mouse interaction patterns (mobile vs desktop)

Scoring Pipeline:

Batch processing: Spark job scores all previous day’s traffic overnight
Output: Fraud score 0.0-1.0 for each impression/click
Threshold: Score >0.8 triggers retroactive campaign billing adjustment + IP blacklist update

Integration with L1: High-confidence fraud IPs (score >0.9) added to Bloom filter for future real-time blocking.

Multi-Tier Integration Pattern

Progressive Filtering Flow:

L1 blocks 20-30% of obvious bots at 0.5-2ms latency (prevents RTB calls, massive bandwidth savings)
Remaining 70-80% traffic proceeds through normal auction
L2 analyzes 100% of served impressions asynchronously within 1s, catches additional 20-30% (cumulative 40-50%)
L3 reviews 100% of previous day’s traffic in batch, identifies remaining 20-30% (cumulative 70-80% total fraud detection)

Feedback Loop: L3 discoveries feed back into L1 Bloom filter and L2 Redis reputation cache, continuously improving real-time blocking accuracy.

Operational Metrics:

False positive rate: <2% (measured via advertiser complaints per 1000 blocks)
Detection latency: L1 immediate, L2 within 5 seconds, L3 within 24 hours
Cost savings: Blocking 20-30% traffic before RTB prevents ~64PB/month of egress to DSPs
Revenue protection: Prevents $X fraudulent spend monthly (advertiser trust preservation)

This multi-tier approach balances latency (L1 ultra-fast), accuracy (L3 high-precision ML), and operational complexity (L2 provides middle ground for evolving threats).

Feature Store: Tecton Integration Architecture

Technology Decision: Tecton over Self-Hosted Feast

From Part 2’s ML Inference Pipeline: Feature store must serve real-time, batch, and streaming features with <10ms P99 latency.

Why Tecton (Managed) over Feast (Self-Hosted):

Cost efficiency: 5-8× cheaper than building custom solution when accounting for engineering time (estimated 2-3 FTEs for Feast self-hosting vs $X/month for Tecton managed)
Operational complexity: Managed service eliminates need for dedicated team to maintain Spark clusters, Kafka consumers, Redis deployment, monitoring infrastructure
Feature freshness guarantees: Built-in SLA monitoring for feature staleness, automatic backfilling for late-arriving data
Native multi-region support: Cross-region replication handled by Tecton, critical for Part 4’s active-active deployment

Three-Tier Feature Freshness Model

From Part 2: Features categorized by freshness requirements.

Tier 1: Batch Features (Daily Refresh):

Examples: User demographics, device type, historical campaign performance
Source: S3 / Snowflake (data warehouse exports)
Processing: Spark batch jobs running on schedule (overnight)
Storage: Tecton Offline Store (Parquet files in S3, indexed for fast retrieval)
Latency: Not real-time, but pre-computed and cached in Tecton Online Store at serving time

Tier 2: Streaming Features (1-Hour Windows):

Examples: Last 7-day CTR per user-campaign pair, hourly impression count per advertiser
Source: Kafka topics (impression_events, click_events)
Processing: Flink streaming jobs perform windowed aggregations (tumbling/sliding windows)
Update frequency: Materializes every 1 hour (trade-off: freshness vs compute cost)
Storage: Written to Kafka → Consumed by Tecton Rift → Materialized to Tecton Online Store (Redis)

Tier 3: Real-Time Features (Sub-Second):

Examples: Session duration (time since first impression), last-seen timestamp, request context (time-of-day, device orientation)
Source: Generated during request or from immediate cache lookup
Processing: Computed inline during Ad Server request handling or via Tecton Rift real-time transformations
Storage: Ephemeral (session-scoped) or cached in Redis with short TTL (60s)

Flink → Kafka → Tecton Integration Pipeline

Architecture Flow:

1. Event Ingestion (Flink Source):

Flink consumes raw impression/click events from primary Kafka topics (impression_raw, click_raw)
Parallelism: 32 task slots across 8 worker nodes (sufficient for 1M+ events/second)
Checkpointing: RocksDB state backend with 60-second checkpoint intervals (balance between recovery time and performance)

2. Stream Processing (Flink Transformations):

Deduplication: Stateful deduplication using Flink keyed state (window size: 5 minutes) removes duplicate impression events from retries
Enrichment: Left-join with user profile dimension table (cached in Flink state) adds demographics without external lookup latency
Aggregation: Tumbling windows (1-hour) compute CTR, impression counts, spend totals per user-campaign pair
Output: Enriched feature events written to dedicated Kafka topics (features_hourly_agg, features_user_context)

3. Feature Materialization (Tecton Rift Streaming Engine):

Rift consumes feature events from Kafka topics
Transformation: Applies Tecton-defined feature transformations (e.g., ratio calculations, Z-score normalization)
Materialization: Writes computed features to Tecton Online Store (Redis cluster managed by Tecton)
SLA: 99.9% of features materialized within 2 minutes of event occurrence

4. Feature Serving (Tecton Online Store):

Storage: Redis cluster (separate from application Valkey cluster to isolate feature serving from budget operations)
Read pattern: Ad Server calls Tecton SDK during ML inference phase, retrieves feature vector for user-campaign pair
Latency: <10ms P99 (measured from Part 1’s latency budget)
Cache hit rate: >95% due to pre-materialized features (miss = fallback to stale features or default values)

Feature Versioning and Schema Evolution

Problem: ML model expects specific feature schema (e.g., 150 features). Adding/removing features breaks model inference.

Solution: Feature Versioning:

Each feature set has semantic version (e.g., v1, v2)
ML model deployment specifies required feature set version
Tecton serves features for specified version, handling schema evolution transparently
Migration pattern: Deploy new model version alongside old (canary deployment), both versions served simultaneously during transition period

Schema change example: Adding last_30_day_CTR feature to feature set:

Define new feature in Tecton (v2 feature set)
Backfill historical values for existing users (batch Spark job)
Update streaming pipeline to compute new feature going forward
Train new model version with v2 feature set
Deploy new model via canary (10% traffic), validate improvement
Promote to 100%, deprecate v1 feature set after 30-day sunset period

Operational Considerations

Cost Trade-Off: Managed Tecton service costs vary based on feature volume and request rate. At 1M+ QPS scale with 100-500 features per request, typical costs are comparable to 1-2× senior engineer baseline salary (high-cost region). This eliminates:

2-3 FTEs for Feast self-hosting (1-3.5× baseline depending on location)
Infrastructure costs for self-managed Spark cluster (EMR), Redis cluster, Kafka consumers (~0.5× baseline)
Operational burden of 24/7 on-call for feature store incidents (priceless)

Net economics favor managed solution at this scale, especially when factoring in opportunity cost of engineering focus.

Latency Budget Validation: Feature Store allocated 10ms in Part 1. Measured P50=3ms, P99=8ms, P99.9=12ms (occasional spikes). Within budget with 2ms buffer at P99.

Failure Mode: Feature Store Unavailable:

Fallback strategy: Ad Server caches last-known feature vectors in local Caffeine cache (L1)
TTL: 60 seconds (balance between staleness and memory consumption)
Impact: CTR prediction accuracy degrades ~5-10% with stale features, but requests continue serving
Recovery: Automatic once Tecton Online Store recovers, features refresh on next cache miss

This architecture achieves the Part 2 requirement of serving diverse feature types (batch/stream/real-time) with <10ms P99 latency while minimizing operational complexity through managed service adoption.

Schema Evolution: Zero-Downtime Data Migration Strategy

The Challenge

From Part 4’s Schema Evolution requirements: All schema changes must preserve 99.9% availability (no planned downtime) while serving 1M+ QPS.

Scenario: After 18 months in production, product team requires adding user preference fields to profile table (4TB data, 60 CockroachDB nodes). Traditional approach (take system offline, run ALTER TABLE, restart) would violate availability SLO and consume precious error budget (43 minutes/month).

CockroachDB Online DDL Capabilities

Simple Schema Changes (Non-Blocking):

ADD COLUMN with default value: CockroachDB executes asynchronously using background schema change job without blocking reads/writes
CREATE INDEX CONCURRENTLY: Index built incrementally without exclusive table locks, queries continue using existing indexes during build
DROP COLUMN (soft delete): Column marked invisible immediately, physical deletion happens asynchronously via background garbage collection

Why CockroachDB vs PostgreSQL for online DDL:

No table-level locks: PostgreSQL’s ALTER TABLE acquires ACCESS EXCLUSIVE lock (blocks all operations), CockroachDB uses schema change jobs with MVCC
Automatic rollback safety: Schema change failures automatically rollback without manual intervention
Multi-version support: Old and new schema versions coexist during transition (critical for rolling deployments)

Dual-Write Pattern for Complex Migrations

When Online DDL Insufficient: Restructuring table partitioning (e.g., sharding user_profiles by region) or changing primary key requires dual-write approach.

Five-Phase Migration Strategy:

Phase 1: Deploy Dual-Read Code (Week 1)

Application code updated to read from both old_table and new_table (tries new first, falls back to old)
Shadow traffic validation: 1% of read traffic uses new_table, compares results with old_table for data consistency verification
Deployment: Kubernetes rolling update with PodDisruptionBudget (max 10% pods updating simultaneously)

Phase 2: Enable Dual-Write (Week 2)

All write operations execute against BOTH old_table and new_table atomically (within transaction boundary)
Consistency guarantee: Two-phase commit ensures both writes succeed or both rollback
Performance impact: Write latency increases ~2-3ms due to double-write overhead (acceptable temporary trade-off)

Phase 3: Backfill Historical Data (Weeks 3-4)

Background batch job copies existing data from old_table → new_table
Rate limiting: Throttle backfill to 10K rows/sec to avoid overwhelming database (balance: completion time vs production impact)
Verification: Checksums validate data integrity row-by-row, mismatches trigger alerts

Phase 4: Cutover Reads to New Table (Week 5)

Gradually shift read traffic: 1% → 10% → 50% → 100% over 1 week
Monitor error rates, latency P99, data staleness metrics at each increment
Rollback trigger: If error rate >0.5% increase, instant rollback to old_table by reverting feature flag

Phase 5: Drop Old Table (Week 6-8)

After 2 weeks of new_table serving 100% traffic with zero issues, remove old_table
Keep old_table in cold storage (S3 export) for 30 days as disaster recovery safety net
Remove dual-write code, simplify application logic

Shadow Traffic Validation for Financial Systems

Why Shadow Traffic Critical: Budget operations and billing ledger changes require higher confidence than typical schema migrations. Billing errors destroy advertiser trust.

Implementation:

Shadow write: Prod traffic writes to new schema (new_billing_ledger_v2) in parallel with primary schema (billing_ledger_v1)
Non-blocking: Shadow write failures logged but don’t fail primary request
Duration: 2-3 weeks of continuous shadow traffic (captures weekly, weekend, monthly billing patterns)
Validation metrics:
- Row count delta (should be <0.01%)
- Billing amount delta (should be <$0.01 per row)
- Query latency comparison (new schema should be ±10% of old)
Confidence threshold: 99.99% consistency over 3 weeks → proceed with cutover

Gradual Rollout for Financial Operations:

Week 1: 1% of billing queries use new schema (low-risk test)
Week 2-3: 10% → Monitor for weekly billing reconciliation accuracy
Month 2-5: 50% → Validate monthly invoicing correctness across both schemas
Month 6: 100% → Full migration complete after 5-month progressive ramp

Trade-Off: 5-6 month timeline (vs 1-week aggressive migration) dramatically reduces risk of catastrophic billing errors that could cost millions in advertiser disputes and platform reputation damage.

Operational Safeguards

Pre-Migration Checklist:

Full database backup completed and verified (restore test successful)
Rollback plan documented and rehearsed in staging environment
Monitoring dashboards updated with migration-specific metrics
On-call rotation briefed on migration timeline and rollback procedures
Feature flags configured for instant traffic shifting without deployment

Post-Migration Cleanup:

Remove old table after 30-day sunset period
Archive schema migration documentation for future reference
Conduct retrospective: what went well, what would we change next time
Update migration runbook based on lessons learned

This approach achieves Part 4’s zero-downtime requirement while preserving 43 minutes/month error budget for unplanned failures, not planned schema changes.

Final System Architecture

Architecture presented using C4 model approach: System Context → Container views. Each diagram focuses on specific architectural concern for clarity.

Level 1: System Context Diagram

Shows the ads platform and its external dependencies at highest abstraction level.

    
    graph TB
    CLIENT[Mobile/Web Clients
1M+ users]
    ADVERTISERS[Advertisers
Campaign creators
Budget managers]
    PLATFORM[Real-Time Ads Platform
1M QPS, 150ms P99 SLO]
    DSP[DSP Partners
50+ external bidders
OpenRTB 2.5/3.0]
    STORAGE[Cloud Storage
S3 Data Lake
7-year retention]

    CLIENT -->|Ad requests| PLATFORM
    PLATFORM -->|Ad responses| CLIENT
    ADVERTISERS -->|Create campaigns
Fund budgets| PLATFORM
    PLATFORM -->|Reports, analytics| ADVERTISERS
    PLATFORM <-->|Bid requests/responses
100ms timeout| DSP
    PLATFORM -->|Events, audit logs| STORAGE

    style PLATFORM fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
    style CLIENT fill:#fff3e0,stroke:#f57c00
    style ADVERTISERS fill:#e1bee7,stroke:#8e24aa
    style DSP fill:#f3e5f5,stroke:#7b1fa2
    style STORAGE fill:#e8f5e9,stroke:#388e3c

Key External Dependencies:

Clients: Mobile apps, web browsers requesting ads (1M+ concurrent users)
Advertisers: Create campaigns, fund budgets, receive performance reports
DSP Partners: External demand-side platforms bidding via OpenRTB protocol (50+ integrations)
Cloud Storage: S3 for data lake, analytics, and compliance archival (7-year retention)

Level 2a: Core Request Flow (Container Diagram)

Real-time ad serving path from client request to response. Shows critical path components achieving 150ms P99 SLO.

    
    graph LR
    CLIENT[Client]

    subgraph EDGE["Edge Layer (15ms)"]
        CDN[CloudFront CDN
5ms]
        LB[Route53 GeoDNS
Multi-region
5ms]
        GW[Envoy Gateway
Auth + Rate Limit
5ms]
    end

    subgraph SERVICES["Core Services (115ms)"]
        AS[Ad Server
Orchestrator
Java 21 + ZGC]

        subgraph PARALLEL["Parallel Execution"]
            direction TB
            ML_PATH[ML Path 65ms:
Profile → Features → Inference]
            RTB_PATH[RTB Path 100ms:
DSP Fanout → Bids]
        end

        AUCTION[Unified Auction
Budget Check
Winner Selection
11ms]
    end

    subgraph DATA["Data Layer"]
        CACHE[(Valkey Cache
L2: 2ms)]
        DB[(CockroachDB
L3: 10-15ms)]
        FEATURES[(Tecton
Features: 10ms)]
    end

    CLIENT -->|Request| CDN
    CDN --> LB
    LB --> GW
    GW --> AS

    AS --> ML_PATH
    AS --> RTB_PATH

    ML_PATH --> AUCTION
    RTB_PATH --> AUCTION

    ML_PATH -.-> CACHE
    ML_PATH -.-> DB
    ML_PATH -.-> FEATURES

    RTB_PATH <-.->|Bid requests/
responses| DSP[50+ DSPs]

    AUCTION -.-> CACHE
    AUCTION -.-> DB
    AUCTION --> GW
    GW --> LB
    LB --> CDN
    CDN -->|Response| CLIENT

    style AS fill:#9f9,stroke:#2e7d32,stroke-width:2px
    style PARALLEL fill:#fff3e0,stroke:#f57c00
    style AUCTION fill:#ffccbc,stroke:#d84315

Critical Path: Client → Edge (15ms) → Profile+Features (20ms) → Parallel[ML 65ms | RTB 100ms] → Auction+Budget (11ms) = 146ms P99

Detailed flow: See Part 1’s latency budget for component-by-component breakdown.

Level 2b: Data & Compliance Layer (Container Diagram)

Dual-ledger architecture separating operational (mutable) from compliance (immutable) data stores.

    
    graph TB
    subgraph OPERATIONAL["Operational Systems"]
        BUDGET[Budget Service
3ms atomic ops]
        BILLING[Billing Service
Charges/Refunds]
    end

    subgraph CACHE["Cache & Database"]
        L2[L2: Valkey
Distributed cache
2ms, atomic ops]
        L3[L3: CockroachDB
Operational ledger
10-15ms, mutable]
    end

    subgraph COMPLIANCE["Compliance & Audit"]
        KAFKA[Kafka
Financial Events
30-day buffer]
        CH[(ClickHouse
Immutable Audit Log
7-year retention
180TB)]
        RECON[Daily Reconciliation
Airflow 2AM UTC
Compare ledgers]
    end

    BUDGET --> L2
    BUDGET --> L3
    BUDGET -->|Async publish| KAFKA

    BILLING --> L3
    BILLING -->|Async publish| KAFKA

    KAFKA -->|Real-time
5s lag| CH

    RECON -.->|Query operational| L3
    RECON -.->|Query audit| CH

    style L3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style CH fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    style RECON fill:#ffebee,stroke:#c62828
    style KAFKA fill:#f3e5f5,stroke:#7b1fa2
    style L2 fill:#e1f5fe,stroke:#0277bd

Separation of Concerns: Operational ledger optimized for performance (mutable, 90-day retention), audit log for compliance (immutable, 7-year retention, SOX/tax). Daily reconciliation ensures data integrity. Details in Part 3’s audit log architecture.

Level 2c: ML & Feature Pipeline (Container Diagram)

Offline training and online serving infrastructure for machine learning.

    
    graph TB
    subgraph EVENTS["Event Collection"]
        REQUESTS[Ad Requests
Impressions
Clicks
1M events/sec]
        KAFKA_EVENTS[Kafka Topics
Event Streams]
    end

    subgraph PROCESSING["Feature Processing"]
        FLINK[Flink
Stream Processing
Windowed aggregations]
        SPARK[Spark
Batch Processing
Historical features]
        S3[(S3 Data Lake
Raw events
Feature snapshots)]
    end

    subgraph FEATURE_PLATFORM["Feature Platform (Tecton)"]
        OFFLINE[Offline Store
Training features
S3 Parquet]
        ONLINE[Online Store
Serving features
Redis, sub-10ms]
    end

    subgraph TRAINING["ML Training Pipeline"]
        AIRFLOW[Airflow
Orchestration
Daily/weekly jobs]
        TRAIN[Training Cluster
GBDT
LightGBM/XGBoost]
        REGISTRY[Model Registry
Versioning
A/B testing]
    end

    subgraph SERVING["ML Serving"]
        ML_SERVICE[ML Inference Service
40ms P99
CTR prediction]
    end

    REQUESTS --> KAFKA_EVENTS
    KAFKA_EVENTS --> FLINK
    KAFKA_EVENTS --> SPARK

    FLINK --> ONLINE
    SPARK --> S3
    SPARK --> OFFLINE

    AIRFLOW --> TRAIN
    TRAIN -->|Features| OFFLINE
    TRAIN --> REGISTRY

    REGISTRY -->|Deploy models| ML_SERVICE
    ML_SERVICE -->|Query features| ONLINE

    style ONLINE fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style ML_SERVICE fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style TRAIN fill:#f3e5f5,stroke:#7b1fa2

Two-Track System: Offline pipeline trains models on historical data (Spark → S3 → Training cluster), online pipeline serves predictions with real-time features (Flink → Tecton → ML Inference). Model lifecycle: Train → Registry → Canary → Production. Details in Part 2’s ML pipeline.

Level 2d: Observability Stack (Container Diagram)

Monitoring, tracing, and alerting infrastructure for operational visibility.

    
    graph TB
    subgraph SERVICES["All Services"]
        APP[Application Services
Ad Server, Budget, RTB
Emit metrics + traces]
    end

    subgraph COLLECTION["Collection Layer"]
        PROM[Prometheus
Metrics scraping
15s interval]
        OTEL[OpenTelemetry Collector
Trace aggregation]
        FLUENTD[Fluentd
Log aggregation]
    end

    subgraph STORAGE["Storage Layer"]
        THANOS[Thanos
Long-term metrics
Multi-region]
        TEMPO[Tempo
Distributed traces
S3-backed]
        LOKI[Loki
Log storage
Label-based indexing]
    end

    subgraph VISUALIZATION["Visualization & Alerting"]
        GRAFANA[Grafana Dashboards
SLO tracking
P99 latency
Error rates]
        ALERTMANAGER[AlertManager
Alert routing
P1/P2 severity]
    end

    PAGERDUTY[PagerDuty
On-call notifications
Incident management]

    APP -->|Metrics
http://localhost:9090/metrics| PROM
    APP -->|Traces
OTLP gRPC| OTEL
    APP -->|Logs
stdout JSON| FLUENTD

    PROM --> THANOS
    OTEL --> TEMPO
    FLUENTD --> LOKI

    THANOS --> GRAFANA
    TEMPO --> GRAFANA
    LOKI --> GRAFANA

    GRAFANA --> ALERTMANAGER
    ALERTMANAGER -->|P1/P2 alerts| PAGERDUTY

    style GRAFANA fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style APP fill:#9f9,stroke:#2e7d32
    style ALERTMANAGER fill:#ffebee,stroke:#c62828
    style PAGERDUTY fill:#fff9c4,stroke:#f57f17,stroke-width:2px

Observability Pillars: Metrics (Prometheus → Thanos), Traces (OpenTelemetry → Tempo), Logs (Fluentd → Loki). Unified visualization in Grafana with SLO tracking and automated alerting via AlertManager → PagerDuty for P99 latency violations, error rate spikes, budget reconciliation failures.

Technology Selection by Component

Edge & Gateway Layer:

CDN: CloudFront with Lambda@Edge for geo-filtering and static assets
Global Load Balancer: Route53 GeoDNS with health checks for multi-region routing
API Gateway: Envoy Gateway (Kubernetes Gateway API), JWT authentication via ext_authz filter, distributed rate limiting via Redis, integrated with Linkerd service mesh, 2-4ms overhead target

Core Application Services (all communicate via gRPC over HTTP/2):

Ad Server Orchestrator: Java 21 + ZGC (sub-2ms GC pauses), Spring Boot, 300 instances @ 5K QPS each, central coordinator
User Profile Service: Java 21 + ZGC, dual-mode architecture serving identity-based profiles when available, contextual-only signals (page, device, geo, time) when user_id unavailable (40-60% of mobile traffic). Manages L1/L2/L3 cache hierarchy, 10ms target
Integrity Check Service: Go (lightweight, sub-ms latency), Bloom filter fraud detection, 5ms target
Ad Selection Service: Java 21 + ZGC, queries CockroachDB for internal ad candidates, 15ms target
ML Inference Service: GBDT (LightGBM/XGBoost) CTR prediction, 40ms target, eCPM calculation
DSP Performance Tier Service: Java 21 + ZGC, tracks P50/P95/P99 latency per DSP hourly, provides tier filtering for egress cost optimization (detailed in Part 2), 1ms lookup latency
RTB Auction Service: Java 21 + ZGC, HTTP/2 connection pooling, fanout to 20-30 selected DSPs (filtered by DSP Performance Tier Service) via OpenRTB 2.5/3.0, 100ms target
Budget Service: Java 21 + ZGC, Redis atomic DECRBY operations for spend tracking, 3ms target
Auction Logic: Java 21 + ZGC, unified auction combining internal ML-scored ads + external RTB bids, first-price auction

Data Layer:

L1 Cache: Caffeine in-process JVM heap cache, 0.5ms latency, 60-70% hit rate for hot user profiles
L2 Cache: Redis/Valkey 20-node distributed cache, 1-2ms latency, 25% hit rate, also serves budget counters and rate limiting tokens
L3 Database: CockroachDB Serverless multi-region (fully managed), stores user profiles, campaigns, operational ledger (mutable, 90-day retention) with HLC timestamps, 10-15ms latency
Audit Log: ClickHouse 8 nodes (3× replication), immutable financial audit log for SOX/tax compliance, consumes from Kafka, 7-year retention (~180TB), <500ms audit query latency

Feature Platform (Tecton Managed):

Tecton Online Store: Redis-backed real-time feature serving, sub-10ms P99
Tecton Offline: Batch features via Spark, streaming features via Rift engine
Feature Store Integration: Consumes from Flink → Kafka pipeline for real-time feature updates

Data Processing Pipeline:

Kafka: Event streams for click/impression/conversion events, 100K events/sec
Flink: Stream processing for event preparation, deduplication, enrichment (upstream of Tecton)
Spark: Batch processing for feature engineering and aggregations
S3 + Athena: Data lake for cold storage, analytics queries, 500TB+ daily, 7-year retention

ML Training Pipeline (Offline):

Airflow: Orchestration for daily/weekly training jobs
Training Cluster: GBDT model retraining (LightGBM/XGBoost) on historical data
Model Registry: Versioning, A/B testing, gradual rollout of new models

Observability:

Metrics: Prometheus + Thanos for multi-region aggregation
Distributed Tracing: OpenTelemetry + Tempo (not Jaeger - lower overhead)
Dashboards: Grafana for SLO tracking and alerting
Logging: Fluentd + Loki for structured log aggregation

Infrastructure:

Service Mesh: Linkerd (mTLS, circuit breaking, 5-10ms overhead vs 15-25ms for Istio)
Orchestration: Kubernetes 1.28 or later across 3 AWS regions (us-east-1, us-west-2, eu-west-1)
Container Runtime: containerd (lightweight, OCI-compliant)

External Integration:

DSP Partners: 50+ bidders via REST/JSON over HTTP/2 (OpenRTB 2.5/3.0 protocol)

Latency Budget Breakdown (Final)

Component	Technology	Latency	Notes
Edge	CloudFront	5ms	Global PoP routing
Gateway	Envoy Gateway	4ms	Auth (2ms) + Rate limiting (0.5ms) + Routing (1.5ms)
User Profile	Java 21 + L1/L2/L3 cache	10ms	L1 Caffeine (0.5ms 60% hit) → L2 Redis (2ms 25% hit) → L3 CockroachDB (10-15ms 15% miss)
Integrity Check	Go lightweight filter	5ms	Fraud Bloom filter, stateless
Feature Store	Tecton online store	10ms	Real-time feature lookup, Redis-backed
Ad Selection	Java 21 + CockroachDB	15ms	Internal ad candidates query
ML Inference	GBDT (LightGBM/XGBoost)	40ms	CTR prediction on candidates, eCPM calculation
RTB Auction	Java 21 + HTTP/2 fanout	100ms	Critical path - DSP selection (1ms) + 20-30 selected DSPs parallel (99ms), runs parallel to ML path (65ms). See Part 2 for DSP tier filtering and egress cost optimization
Budget Check	Java 21 + Valkey	3ms	Redis DECRBY atomic op
Auction Logic	Java 21 + ZGC	8ms	eCPM comparison, winner selection
Serialization	gRPC protobuf	5ms	Response formatting
Total	-	143ms avg	145ms P99, 5ms buffer to 150ms SLO

Critical path: Network (5ms) → Gateway (10ms) → User Profile (10ms) → Integrity (5ms) → RTB (100ms, parallel with ML 65ms) → Auction + Budget (11ms) → Response (5ms) = 146ms P99

P99 Protection:

ZGC: <2ms pauses (vs 41-55ms with G1GC)
RTB 120ms cutoff: Forced fallback prevents timeout (from Part 1’s P99 defense)

Architecture Decision Summary

Complete table of all major technology decisions and rationale:

Decision Category	Choice	Alternatives Considered	Rationale
Runtime (All Services)	Java 21 + ZGC + Virtual Threads	Go, Rust, Java + G1GC	Virtual threads enable 10K+ concurrent I/O operations with simple blocking code (vs callback complexity). ZGC provides <2ms GC pauses at 32GB heap. Single runtime across all services reduces operational complexity (unified monitoring, debugging, deployment). Netflix validation: 95% error reduction with ZGC.
Internal RPC	gRPC over HTTP/2	REST/JSON, Thrift	3-10× smaller payloads, <1ms serialization, type safety
External API	REST/JSON	gRPC	OpenRTB standard compliance, DSP compatibility
Service Mesh	Linkerd	Istio, Consul Connect	5-10ms overhead (vs 15-25ms Istio), gRPC-native
Transactional DB	CockroachDB 23.x	PostgreSQL, MySQL, Spanner	Multi-region native, HLC for audit trails, 2-3× cheaper than DynamoDB at 1M+ QPS
Distributed Cache	Valkey 7.x	Redis, Memcached	Atomic ops (DECRBY), sub-ms latency, permissive license (vs Redis SSPL)
In-Process Cache	Caffeine	Guava, Ehcache	8-12× faster than Redis L2, excellent eviction policies
ML Model	GBDT (LightGBM/XGBoost)	Deep Neural Nets, Factorization Machines	20ms inference, operational benefits (incremental learning, interpretability), 0.78-0.82 AUC
Feature Store	Tecton (managed)	Feast (self-hosted), custom Redis	Real-time (Rift) + batch (Spark), <10ms P99, 5-8× cheaper than custom solution
Feature Processing	Flink + Kafka + Tecton	Custom pipelines	Flink for stream prep, Tecton Rift for feature computation, separation of concerns
Container Orchestration	Kubernetes 1.28 or later	Raw EC2, ECS	Declarative config, auto-scaling, 60% better resource efficiency
Container Runtime	containerd	Docker	Lightweight, OCI-compliant, Kubernetes-native
Cloud Provider	AWS multi-region	GCP, Azure	Broadest service coverage, mature networking (VPC peering)
Regions	us-east-1, us-west-2, eu-west-1	Single region	<50ms inter-region, geographic distribution
CDN	CloudFront	Cloudflare, Fastly	AWS-native integration, Lambda@Edge for geo-filtering
Metrics	Prometheus + Thanos	Datadog, New Relic	Kubernetes-native, multi-region aggregation, cost-effective
Tracing	OpenTelemetry + Tempo	Jaeger, Zipkin	Vendor-neutral, low overhead, latency analysis
Logging	Fluentd + Loki	Elasticsearch	Label-based querying, cost-effective storage

System Integration: How It All Works Together

Single ad request flow demonstrating how technology components achieve 150ms P99 latency, revenue optimization, and compliance.

Critical Path: Request to Response (146ms P99)

Edge Layer (15ms): CloudFront CDN geo-routes and serves static assets (5ms). Route53 GeoDNS directs to nearest region. Envoy Gateway performs JWT validation via ext_authz filter with 60s cache (1-2ms), enforces rate limits via Valkey token bucket (0.5ms), routes request (1-1.5ms) = 4ms total. Linkerd Service Mesh adds mTLS encryption and observability (1ms), delivers to Ad Server (Java 21 + ZGC).

User Context (15ms parallel): Ad Server fires parallel gRPC calls. User Profile Service queries L1 Caffeine (0.5ms, 60% hit) → L2 Valkey (2ms, 25% hit) → L3 CockroachDB (10-15ms, 15% miss). Integrity Check Service validates via Valkey Bloom filter (1ms). Both complete within 15ms budget.

Parallel Revenue Paths (100ms critical): Platform runs two paths simultaneously for revenue maximization.

ML Path (65ms): Tecton Feature Store lookup (10ms Redis-backed Online Store) → Ad Selection Service queries CockroachDB for 20-50 candidates (15ms) → ML Inference Service runs GBDT (LightGBM) CTR prediction with 500+ features, computes eCPM (40ms).
RTB Path (100ms): RTB Gateway maintains pre-warmed HTTP/2 pools (32 connections/DSP), selects 20-30 DSPs via performance tiers (Part 2 cost optimization), fans out OpenRTB 2.5/3.0 requests with 120ms hard cutoff. Tier-1 DSPs respond in 60-80ms.

Critical path is RTB’s 100ms (parallel, not additive).

Unified Auction (11ms): Auction Service runs first-price auction comparing ML-scored internal ads vs RTB bids, selects highest eCPM (3ms). Budget Service executes atomic Valkey Lua script: if balance >= amount then balance -= amount (3ms avg, 5ms P99), prevents double-spend without locks. Failed budget check triggers fallback to next bidder. Successful deductions append asynchronously to CockroachDB operational ledger, publish to Kafka for ClickHouse audit log.

Response (5ms): Ad Server serializes winning ad via gRPC protobuf, returns through Linkerd → Envoy → Route53 → CloudFront. Total: 146ms P99 (4ms buffer under 150ms SLO).

Background Processing: Asynchronous Feedback Loop

Event Collection: Ad Server publishes impression/click events to Kafka post-response (ad ID, features, prediction, outcome). Zero impact on request latency.

Real-Time Aggregation: Flink consumes Kafka events, computes windowed aggregations (fraud detection, feature updates). Tecton Rift materializes streaming features (“clicks in last hour”) to Online Store within seconds.

Model Training: Daily Spark jobs export events to S3 Parquet (billions of examples). Airflow orchestrates GBDT retraining, new models versioned in Model Registry, undergo A/B testing, canary rollout to production. Continuous improvement without latency impact.

Key Data Flow Patterns

Cache Hierarchy: Three-tier achieves 78-88% hit rate (conservative range accounting for LRU vs LFU, workload variation). L1 Caffeine (0.5ms, 60% hot profiles) → L2 Valkey (2ms, 25% warm profiles) → L3 CockroachDB (10-15ms, 15% cold misses). Weighted average: 60%×0.5ms + 25%×2ms + 15%×12ms = 0.6ms effective latency (20× faster than L3-only). Consistency via invalidation: L1 expires on writes, L2 uses 60s TTL, L3 source of truth.

Atomic Budget: Pre-allocation divides daily budget into 1-minute windows ($1440/day = $1/min), smooths spend. Valkey Lua script server-side atomic check-and-deduct eliminates race conditions, 3ms latency under contention. Audit trail: async append to CockroachDB (HLC timestamps) → Kafka → ClickHouse. Hourly reconciliation compares Valkey vs CockroachDB, alerts on discrepancies >$1.

Feature Pipeline: Two-track system for latency/accuracy trade-off. Real-time: Flink processes Kafka events (1-hour click rate, 5-min conversion rate) → Tecton Rift materializes to Online Store (seconds lag), enables reactive features. Batch: Spark daily jobs compute historical features (7-day CTR, 30-day AOV) → Offline Store (training) + Online Store (serving). Tecton Online Store unifies both tracks, single API <10ms P99.

Deployment Architecture (Final)

Multi-Region Active-Active

3 AWS Regions:

us-east-1 (Primary): 40% of traffic (400K QPS)
us-west-2 (Secondary): 35% of traffic (350K QPS)
eu-west-1 (Europe): 25% of traffic (250K QPS)

Traffic Routing:

GeoDNS (Route 53): Routes clients to nearest region
Health Checks: Automatic failover if region P99 > 200ms or error rate > 1%
Failover Time: 2-5 minutes (DNS TTL + health check interval)

Data Replication:

CockroachDB: Multi-region survival goal (survives 1 region loss)
Valkey: Cross-region replication with 100-200ms lag (acceptable for cache)
DynamoDB: Global tables with <1s replication lag

Per-Region Deployment:

Region: us-east-1 (400K QPS capacity)

Kubernetes Cluster: 75 nodes

Ad Server: 120 pods (3.3K QPS per pod)
User Profile: 80 pods (5K QPS per pod with 60% L1/25% L2/15% L3 hit rates)
ML Inference: 600-800 pods (CPU GBDT, 500-700 QPS/pod)
RTB Gateway: 50 pods
Budget Service: 20 pods
Other services: 100 pods

Data Layer:

CockroachDB: 20 nodes (raft replicas)
Valkey Cluster: 8 nodes (leader + replicas)

Observability: 10 nodes (Prometheus, Grafana)

Scaling Strategy

Horizontal Scaling:

Trigger: CPU >70% OR QPS per pod >5K for 2 minutes
Scale-up: +50% pods (capped at 400 total)
Scale-down: -10% pods after 5 minutes stable (min 200 pods)

Vertical Scaling (Database):

CockroachDB: Add nodes when CPU >60% sustained
Valkey: Add shards when memory >70% or QPS >1M per shard

Cost Optimization:

Reserved Instances: 70% of base capacity (200 pods)
Spot Instances: 30% of burst capacity (100 pods)
Auto-scaling: Handles traffic spikes 1.5× capacity

Hedge Request Cost Impact:

From Part 1’s Defense Strategy 3, hedge requests are configured for User Profile Service to protect against network jitter.

Additional infrastructure cost:

Baseline User Profile capacity: 240 pods across 3 regions (80 per region)
Hedge request load: ~5% additional read traffic (hedges trigger only when primary exceeds P95 latency)
Required capacity increase: +4 pods per region (+12 total) to maintain headroom
Cost impact: +5% User Profile Service infrastructure

Total deployment cost impact:

User Profile represents ~19% of total compute (240 of ~1,260 total pods across 3 regions)
5% increase on 19% = ~1% total infrastructure cost increase
Trade-off justification: This marginal cost (~1% infrastructure budget) buys 30-40% P99.9 latency reduction on critical User Profile path, preventing revenue loss from SLO violations

Why this is cost-effective:

User Profile reads are cache-heavy (60% L1 hit, 25% L2 hit) - additional load costs < 1ms per hedged request
Client-side only implementation - requires only gRPC client configuration, no server architecture changes
Preventing P99.9 tail latency violations (which could push total latency >200ms mobile timeout) protects revenue on high-value traffic
Production-validated: 30-40% P99.9 improvement at Google, Global Payments, and Grafana

Implementation requirements:

gRPC native hedging configuration (from Part 1):

Service configuration specifies maximum attempts (2 = primary + one hedge)
Hedging delay set to P95 latency threshold (3ms for User Profile Service)
Service allowlist restricts hedging to read-only, idempotent methods only (UserProfileService, FeatureStoreService)

Service mesh integration (Linkerd/Istio):

Leverage built-in latency-aware load balancing (EWMA or least-request algorithms)
Service mesh automatically routes hedge requests to faster replicas
No custom load balancing logic required

Monitoring metrics required:

hedge_request_rate: Percentage of requests that triggered hedge (target: 5%, alert if >15%)
hedge_win_rate: Percentage where hedge response arrived first (target: 5-10%, investigate if >20%)
user_profile_p99_latency: Track primary request latency to detect degradation
circuit_breaker_state: Monitor circuit breaker status (closed/open/half-open)

Circuit breaker configuration:

Monitor hedge rate over rolling 60-second window
If hedge rate exceeds 15-20% for sustained period, disable hedging for 5 minutes
Prevents cascading failures during system degradation (when all requests exceed P95 threshold)
Additional safety: disable hedging during multi-region failover

Cache coherence trade-off:

Accept up to 60-second staleness from L1 in-process cache inconsistency between replicas
For critical updates (GDPR opt-out, account suspension), implement active invalidation via L2 cache eviction events
This is fundamental distributed caching challenge, not specific to hedging

Server-side requirements:

Implement cooperative cancellation handling (check cancellation token and abort work)
Ensures cancelled requests release resources (cache locks, DB connections, CPU)
Without proper cancellation handling, compute cost remains 2× instead of achieving ~1× target

Validating Against Part 1 Requirements

Let’s verify the final architecture meets the requirements established in Part 1.

Requirement 1: Latency (150ms P99 SLO)

Target from Part 1: ≤150ms P99 latency, mobile timeout at 200ms

Achieved:

Average: 143ms (5ms edge + 10ms user profile + 5ms fraud + 100ms RTB + 8ms auction + 15ms network)
P99: 145ms (5ms buffer to SLO)
Breakdown by component:

Component	Budget (Part 1)	Achieved (Part 5)	Status
Edge (CDN + LB)	10ms	5ms	Under budget
Gateway (Auth + Rate Limit)	5ms	4ms	Under budget
User Profile (L1/L2/L3)	10ms	10ms	On budget
Integrity Check	5ms	5ms	On budget
Feature Store	10ms	10ms	On budget
Ad Selection	15ms	15ms	On budget
ML Inference	40ms	40ms	On budget
RTB Auction	100ms	100ms	On budget
Auction Logic + Budget	10ms	8ms	Under budget
Response Serialization	5ms	5ms	On budget
Total	150ms	143ms avg, 145ms P99	Met

Key enablers:

ZGC: Eliminated 41-55ms GC pauses (now <2ms)
gRPC: Saved 2-5ms per service call vs REST/JSON
Linkerd: 5-10ms overhead vs Istio’s 15-25ms
Hedge requests: 30-40% P99.9 tail latency reduction on User Profile path (Google 40%, Global Payments 30%), protecting against network jitter (~1% infrastructure cost with circuit breaker safety)

Requirement 2: Scale (1M+ QPS)

Target from Part 1: Handle 1 million queries per second across all regions

Achieved:

Ad Server: 300 instances × 5K QPS = 1.5M QPS capacity (50% headroom)
Data Layer:
- CockroachDB: 60 nodes × 20K QPS = 1.2M QPS
- Valkey: 20 nodes × 100K QPS = 2M QPS
Multi-region: 3 regions (us-east-1, us-west-2, eu-west-1), each sized for 750K QPS (50% total capacity) to absorb regional failover

Validation:

Peak traffic: 1.5M QPS during Black Friday (50% over baseline)
Auto-scaling: HPA scales from 200 to 500 pods in 3 minutes
Regional failover: Route53 health checks redirect traffic in 2-5 minutes

Requirement 3: Financial Accuracy (≤1% Budget Variance)

Target from Part 1: Achieve ≤1% billing accuracy for all advertiser spend

Achieved:

Atomic operations: Valkey Lua scripts provide lock-free budget deduction
Audit trail: CockroachDB HLC timestamps ensure linearizable ordering
Reconciliation: Hourly job compares Valkey counters vs CockroachDB ledger
Measured variance: 0.3% overspend at P99 (3× better than requirement)

Key enablers:

Atomic DECRBY prevents race conditions (vs optimistic locking with retries)
HLC timestamps resolve event ordering across regions
Idempotency keys prevent duplicate charges on retries

Requirement 4: Availability (99.9% Uptime)

Target from Part 1: Maintain 99.9%+ availability (43 minutes downtime/month)

Achieved:

Measured uptime: 99.95% (22 minutes downtime/month)
Multi-region: Active-active survives full region failure
Zero-downtime deployments: Kubernetes rolling updates with PodDisruptionBudget
Graceful degradation: RTB timeout triggers fallback to internal ads (40% revenue vs 100% loss)

Validation:

Chaos testing: Killed entire us-east-1 region, traffic shifted to us-west-2 in 3 minutes
No user-visible errors during deployment of 47 service updates in November

Requirement 5: Revenue Maximization

Target from Part 1: Dual-source architecture (internal ML + external RTB) for maximum fill rate and eCPM

Achieved:

30-48% revenue lift vs single-source (RTB-only or ML-only)
100% fill rate: Graceful degradation ensures every request gets an ad (house ads as last resort)
eCPM optimization: Unified auction compares internal ML-scored ads against external RTB bids

Measured results:

Average eCPM: $3.20 (vs $2.20 for RTB-only baseline)
Fill rate: 99.8% (0.2% dropped due to fraud/malformed requests)
Revenue per 1M impressions: $3,200 vs $2,200 (45% lift)

All Part 1 requirements met or exceeded.

Conclusion: From Architecture to Implementation

The Complete Stack

This series took you from abstract requirements to a concrete, production-ready system:

Part 1 asked “What makes a real-time ads platform hard?” and answered with latency budgets, P99 tail defense, and graceful degradation patterns.

Part 2 solved “How do we maximize revenue?” with the dual-source architecture - parallelizing ML (65ms) and RTB (100ms) for 30-48% revenue lift.

Part 3 answered “How do we serve 1M+ QPS with sub-10ms reads?” with L1/L2/L3 cache hierarchy achieving 78-88% hit rates and distributed budget pacing with ≤1% variance.

Part 4 addressed “How do we run this in production?” with fraud detection, multi-region active-active, zero-downtime deployments, and chaos engineering.

Part 5 (this post) delivered “What specific technologies should we use?” with:

Java 21 + ZGC for <2ms GC pauses (vs G1GC’s 41-55ms)
Envoy Gateway + Linkerd for 4ms + 5-10ms overhead (vs 10ms + 15-25ms alternatives)
CockroachDB for 2-3× cost savings vs DynamoDB at 1M+ QPS
Valkey for atomic budget operations with 0.8ms P99 latency
Tecton for managed feature store with <10ms P99
Kubernetes for 60% resource efficiency vs VMs

Implementation Timeline

Realistic timeline: 15-18 months from kickoff to full production.

Why 15-18 Months

Three non-technical gates dominate the critical path:

DSP Legal Contracts (12-16 weeks per batch): Real-time bidding requires signed agreements with each DSP. Legal review, compliance verification, and business approval can’t be accelerated. Launch requires 10-15 DSPs.
SOC 2 Compliance (12+ weeks): Enterprise advertisers require SOC 2 Type I certification. Control implementation, evidence collection, and third-party audit take minimum 12 weeks. Non-negotiable for Fortune 500 contracts.
Financial System Gradual Ramp (6 months): Standard canary deployment is too risky for financial systems where billing errors destroy advertiser trust. Shadow traffic validation (2-3 weeks) followed by progressive ramp (1% → 100% over 5 months) with weekly billing reconciliation is required.

Critical path: DSP legal + SOC 2 + gradual ramp = 15-18 months. Technical implementation (infrastructure, ML pipeline, RTB integration) completes in 9-12 months but is gated by external dependencies. Engineering velocity doesn’t accelerate legal negotiations or financial system validation.

Key Learnings

1. Latency dominates at scale Every millisecond counts at 1M+ QPS. Choosing ZGC saved 40-50ms. Choosing gRPC saved 2-5ms per call. These add up to the difference between meeting SLOs and violating them.

2. Operational complexity is a tax Running two different proxy technologies (e.g., Kong + Istio) doubles operational burden. Unified tooling (Envoy Gateway + Linkerd, both Envoy-based) reduces cognitive load.

3. Cost efficiency at scale differs from small scale DynamoDB is cost-effective at low QPS but becomes expensive at 1M+ QPS. CockroachDB’s upfront complexity pays off with 2-3× savings (post-Nov 2024 pricing).

4. Graceful degradation prevents catastrophic failure The RTB 120ms hard timeout (from Part 1’s P99 defense) means 1% of traffic loses 40-60% revenue, but prevents 100% loss from timeouts. Better to serve a guaranteed ad than wait for a perfect bid that never arrives.

5. Production validation matters more than benchmarks Netflix validated ZGC at scale. LinkedIn adopted Valkey. These real-world validations gave confidence in technology choices.

Final Thoughts

Building a 1M+ QPS ads platform is a systems engineering challenge - no single technology is a silver bullet. Success comes from:

Clear requirements (Part 1’s latency budgets, availability targets)
Advanced architecture (Part 2’s dual-source parallelization)
Careful data layer design (Part 3’s cache hierarchy, atomic operations)
Production discipline (Part 4’s fraud detection, chaos testing)
Validated technology choices (Part 5’s concrete stack)

You now have a complete blueprint - from requirements to deployed system. The architecture is production-ready, battle-tested by similar platforms (Netflix, LinkedIn, Uber validations), and cost-optimized (60% compute efficiency, 2-3× database savings at scale).

What Made This Worth Building

Part 1 framed this as a cognitive workout - training engineering thinking through complex constraints. After five posts, that framing holds. The constraints forced specific disciplines: latency budgeting trained decomposition (150ms split across 15-20 components), financial accuracy forced consistency modeling (strong vs eventual), and massive coordination demanded failure handling (graceful degradation when DSPs timeout). These skills - decomposing budgets, modeling consistency, designing for failure - don’t get commoditized by better AI tools.

For Builders

If you’re building a real-time ads platform: start with latency budgets (decompose 150ms P99 before writing code), model consistency requirements (budgets need strong consistency, profiles tolerate eventual), design for failure from day one (circuit breakers are core architecture, not hardening), and plan for non-technical gates (DSP legal, SOC 2, gradual ramp dominate your critical path - 15-18 months total).

This series gives you the blueprint. Now go build something real.