Complete Implementation Blueprint: Technology Stack & Architecture Guide

Introduction: From Requirements to Reality

Over the past four parts of this series, we’ve built up the architecture for a real-time ads platform serving 1M+ QPS with 150ms P99 latency:

Part 1 established the architectural foundation - requirements analysis, latency budgeting (decomposing 150ms across components), resilience patterns (circuit breakers, graceful degradation), and the P99 tail latency challenge. We identified three critical drivers: revenue maximization, sub-150ms latency, and 99.9% availability. These requirements shaped every decision that followed.

Part 2 designed the dual-source revenue engine - parallelizing internal ML-scored inventory (65ms) with external RTB auctions (100ms) to achieve 30-48% revenue lift over single-source approaches. We detailed the OpenRTB protocol implementation, GBDT-based CTR prediction, feature engineering pipeline, and timeout handling strategies.

Part 3 built the data layer - L1/L2/L3 cache hierarchy (Caffeine → Redis/Valkey → CockroachDB) achieving 78-88% hit rates and sub-10ms reads. We covered eCPM-based auction mechanisms for fair price comparison across CPM/CPC/CPA models, and distributed budget pacing using atomic operations with proven ≤1% overspend guarantee.

Part 4 addressed production operations - pattern-based fraud detection (20-30% bot filtering), active-active multi-region deployment with 2-5min failover, zero-downtime schema evolution, clock synchronization for financial ledgers, observability with error budgets, zero-trust security, and chaos engineering validation.

Part 5 (this post) brings it all together - the complete technology stack with concrete choices, detailed configurations, and integration patterns. This is where abstract requirements become a deployable system.

What This Post Covers

  1. Complete Technology Stack - Every component with specific versions, rationale, and alternatives considered
  2. Technology Decision Framework - The five criteria used for every choice
  3. Runtime & Infrastructure - Java 21 + ZGC configuration, Kubernetes cluster setup, container orchestration
  4. Communication Layer - gRPC setup with connection pooling, Linkerd service mesh configuration
  5. Data Layer - CockroachDB cluster topology, Valkey sharding strategy, Caffeine cache sizing
  6. Feature Platform - Tecton architecture (Offline: Spark + Rift, Online: Redis), Flink integration
  7. Observability - Prometheus + Thanos multi-region setup, Tempo sampling strategy, Grafana dashboards
  8. Integration Patterns - How all components work together as a cohesive system
  9. Validation - How the final architecture meets Part 1’s requirements

Let’s dive into the decisions.


Complete Technology Stack

Here’s the final stack, organized by layer:

Application Layer

ComponentTechnologyVersionRationale
Ad Server OrchestratorJava + Spring Boot21 LTSEcosystem maturity, ZGC availability, team expertise
Garbage CollectorZGC (Z Garbage Collector)Java 21+<1ms p99.9 pauses, eliminates GC as P99 contributor
User Profile ServiceJava + Spring Boot21 LTSDual-mode architecture (identity + contextual fallback), consistency with orchestrator
ML InferenceGBDT (LightGBM/XGBoost)-Day-1 CTR prediction, 20ms inference. Evolution path: two-pass ranking with distilled DNN reranker (see Part 2)
Budget ServiceJava + Spring Boot21 LTSStrong consistency requirements, atomic operations
RTB GatewayJava + Spring Boot21 LTSHTTP/2 connection pooling, protobuf support
Integrity CheckGo1.21+Sub-ms latency, minimal resource footprint, stateless filtering

Communication Layer

ComponentTechnologyRationale
Internal RPCgRPC over HTTP/2Binary serialization (3-10× smaller than JSON), type safety, <1ms overhead
External APIREST/JSON over HTTP/2OpenRTB standard compliance, DSP compatibility
Service MeshLinkerdLightweight (5-10ms overhead), native gRPC support, mTLS
Service DiscoveryKubernetes DNSBuilt-in, no external dependencies, <1ms resolution
Load BalancingKubernetes Service + gRPC client-sideL7 awareness, connection-level distribution

Data Layer

ComponentTechnologyRationale
L3: Transactional DBCockroachDB ServerlessUser profiles, campaigns, billing ledger. Strong consistency, cross-region ACID transactions, HLC timestamps. 50-75% cheaper than DynamoDB, fully managed. Self-hosted break-even depends on operational costs (see capacity planning).
L2: Distributed CacheValkey 7.x (Redis fork)Budget counters (DECRBY atomic), L2 cache, rate limit tokens. Sub-ms latency, permissive BSD-3 license
L1: In-Process CacheCaffeineHot user profiles, 60-70% hit rate. 8-12× faster than Redis, JVM-native, excellent eviction
Feature StoreTecton (managed)Batch (Spark) + Streaming (Rift) + Real-time online store. Sub-10ms P99, Redis-backed

Infrastructure Layer

ComponentTechnologyRationale
Container OrchestrationKubernetes 1.28 or laterIndustry standard, declarative config, auto-scaling, multi-region federation
Container RuntimecontainerdLightweight, OCI-compliant, lower overhead than Docker
Cloud ProviderAWS (multi-region)Broadest service coverage, mature networking (VPC peering, Transit Gateway)
Regionsus-east-1, us-west-2, eu-west-1Geographic distribution, <50ms inter-region latency
CDN/EdgeCloudFront + Lambda@EdgeGlobal PoPs, request routing, geo-filtering

Observability Layer

ComponentTechnologyRationale
MetricsPrometheus + ThanosKubernetes-native, multi-region aggregation, PromQL for SLO queries
Distributed TracingOpenTelemetry + TempoVendor-neutral, low overhead, latency analysis across services
LoggingFluentd + LokiStructured logs, label-based querying, cost-effective storage
AlertingAlertmanagerIntegrated with Prometheus, SLO-based alerts, escalation policies

Technology Decision Framework

Every technology choice in this architecture was evaluated against five criteria:

1. Latency Impact

Does it fit within the component’s latency budget? (From Part 1’s latency decomposition)

2. Operational Complexity

How many additional systems, proxies, or failure modes does it introduce?

3. Cost Efficiency

What’s the total cost of ownership at 1M+ QPS scale?

4. Team Expertise

Can the team operate it effectively, or does it require hiring specialists?

5. Production Validation

Has it been proven at similar scale by other companies?

When trade-offs were necessary, latency always won - because every millisecond lost reduces revenue at 1M+ QPS.


Runtime & Garbage Collection: Java 21 + ZGC

Decision: Java 21 + Generational ZGC

Why Java over Go/Rust:

  1. Ecosystem maturity: Battle-tested libraries for ads (OpenRTB, protobuf, gRPC), mature monitoring tools
  2. Team expertise: Java developers are easier to hire than Rust specialists
  3. Sub-millisecond GC: Modern ZGC eliminates GC as a latency source

Why ZGC over G1GC/Shenandoah:

ZGC Configuration

Key Configuration Decisions:

Heap Sizing: 32GB heap chosen based on allocation rate analysis. With 5,000 QPS per instance and average request creating ~50KB objects, allocation rate reaches 250 MB/sec. At this rate with ZGC’s concurrent collection, heap cycles every ~2 minutes at 50% utilization.

Why 32GB:

Thread Pool Strategy:

Validation: From Part 1: P99 tail is 10,000 req/sec. With G1GC’s 41-55ms pauses, 410-550 requests would timeout per pause. ZGC’s <2ms P99.9 pauses (32GB heap) affect only 20 requests - 98% reduction in GC-caused timeouts.


Communication Layer: gRPC + Linkerd

gRPC Configuration

Why gRPC over REST/JSON: From Part 1’s latency budget, service-to-service calls must be <10ms. JSON parsing overhead adds 2-5ms per request.

Connection Pooling Strategy:

Each Ad Server instance maintains 32 persistent connections to each downstream service. At 5,000 QPS per instance, this yields ~156 requests per second per connection, effectively reusing connections and avoiding expensive connection establishment overhead (TLS handshakes cost 10-20ms).

Key configuration decisions:

Load balancing: Round-robin distribution across service replicas with DNS-based service discovery (Kubernetes DNS provides automatic endpoint updates).

Retry Policy: Maximum 2 attempts with exponential backoff (10ms → 50ms). Critical: Only retry UNAVAILABLE status (service temporarily down), never DEADLINE_EXCEEDED (timeout) - retrying timeouts amplifies cascading failures under load.

Service Mesh: Linkerd

Decision: Linkerd over Istio

From Part 1: We need <5ms gateway overhead, sub-10ms service-to-service latency.

Benchmarks:

Why Linkerd:

  1. Lower latency: 5-10ms vs Istio’s 15-25ms
  2. Lower resource usage: ~50MB memory per proxy vs Envoy’s ~150MB
  3. Rust-based proxy: linkerd2-proxy is lighter than Envoy (C++)
  4. gRPC-native: Zero-copy proxying for gRPC (our primary protocol)

Configuration:

Service profile for User Profile Service: Service Profile Configuration: Linkerd ServiceProfiles define per-route behavior for fine-grained traffic management:

This per-route configuration ensures timeouts match Part 1’s latency budget while preventing retry storms during service degradation.

mTLS (Mutual TLS) Encryption:

Traffic Splitting for Canary Deployments: Linkerd’s SMI TrafficSplit API enables gradual rollouts by weight-based routing:

This pattern (detailed in Part 4 Production Operations) reduces blast radius of defects while maintaining production velocity.

API Gateway: Envoy Gateway Decision

From Part 1’s latency budget, gateway operations (authentication, rate limiting, routing) must complete within 4-5ms to preserve 150ms SLO. Envoy Gateway achieves 2-4ms total overhead: JWT auth via ext_authz filter (1-2ms, cached 60s), rate limiting via Valkey token bucket (0.5ms atomic DECR), routing decisions (1-1.5ms). Production measurements: P50 2.8ms, P99 4.2ms.

Technology Comparison

GatewayLatency OverheadMemory per PodOperational ComplexityKubernetes-Native
Envoy Gateway2-4ms50-80MBLow (Envoy config only)Gateway API native
Kong10-15ms150-200MBMedium (plugin ecosystem learning curve)CRD-based
Traefik5-8ms100-120MBMedium (label-based config, less flexible)Gateway API support
NGINX Ingress3-6ms80-100MBMedium (annotation-heavy, error-prone)Annotation-based

Kong rejected: 10-15ms latency (7-10% of budget), 150-200MB memory, different proxy tech from service mesh (Kong Lua + Istio Envoy = 20-30ms combined overhead). NGINX rejected: annotation-based config error-prone (nginx.ingress.kubernetes.io/rate-limit typo fails silently), no native gRPC support, external rate-limit sidecar complexity. Traefik rejected: label-based config insufficient for RTB’s sophisticated timeout/header transformation requirements.

Unified Proxy Stack with Linkerd Service Mesh

Platform handles two traffic patterns: north-south (external → cluster via Envoy Gateway) and east-west (internal service-to-service via Linkerd). Both use Envoy proxy technology, enabling smooth transitions without double-proxying overhead. Alternative (Kong + Istio) requires learning two proxies, separate observability, 20-30ms combined latency.

Traffic flow: External request → Envoy Gateway (TLS termination, JWT validation, rate limiting) → Linkerd sidecar (mTLS encryption, load balancing, retries) → Ad Server → internal calls via Linkerd (automatic mTLS, observability). Each service hop adds ~1ms Linkerd overhead; 3-4 hops = 3-4ms total, well within budget. Achieves zero-trust (every call authenticated/encrypted) without code changes.

Gateway API benefits: HTTPRoute enables per-DSP timeout policies and header transformations declaratively. ReferenceGrant provides namespace isolation for multi-tenant deployments. Native HTTP/2, gRPC, WebSocket support eliminates manual proxy_pass configuration for RTB bidstream.

Trade-off: Smaller plugin ecosystem vs Kong. Complex transformations (GraphQL→REST) implemented as dedicated microservices rather than gateway plugins, preserving low latency while allowing independent scaling.


Container Orchestration: Kubernetes

Why Kubernetes over Raw EC2/VMs

Kubernetes Provides:

  1. Declarative Configuration: Define desired state, Kubernetes reconciles
  2. Auto-Scaling: Horizontal Pod Autoscaler (HPA) scales based on metrics
  3. Self-Healing: Automatic pod restarts, node failure recovery
  4. Service Discovery: Built-in DNS, no external registry needed
  5. Rolling Updates: Zero-downtime deployments with health checks
  6. Multi-Region Federation: Cluster federation for global deployment

Why Not Raw EC2:

Resource Efficiency Example:

Kubernetes Architecture

Cluster Configuration:

Namespaces:

Auto-Scaling Strategy:

Horizontal Pod Autoscaler (HPA) monitors both CPU utilization (target: 70%) and custom metrics (requests per second per pod). Scaling triggers when pods exceed 5K QPS threshold. Scale-up happens aggressively (50% increase) with 60-second stabilization window, while scale-down is conservative (10% reduction) with 5-minute stabilization to avoid flapping. Minimum 200 pods ensures baseline capacity, maximum 400 pods caps burst handling.

Why containerd over Docker:


Data Layer: CockroachDB Cluster

Decision: CockroachDB over PostgreSQL/Spanner/DynamoDB

From Part 1 and Part 3: Need strongly-consistent transactional database for billing ledger, multi-region active-active, 10-15ms latency.

Why CockroachDB:

  1. 2-3× cheaper than DynamoDB at 1M+ QPS (see cost breakdown below)
  2. Postgres-compatible - existing team expertise, tooling compatibility
  3. HLC timestamps for linearizable billing events (Part 3 requirement)
  4. Multi-region native - automatic replication, leader election
  5. No vendor lock-in (vs Spanner’s Google-only deployment)

Cost comparison (1M QPS, 8 billion writes/day):

Cluster Topology

Day-1 Choice: CockroachDB Serverless

Self-Hosted Configuration (if operational costs justify it):

Why 60-80 nodes (self-hosted sizing): From benchmarks: CockroachDB achieves 400K QPS (99% reads) with 20 nodes, 1.2M QPS (write-heavy) with 200 nodes.

Our workload: ~70% reads, ~30% writes, 1M+ QPS total → 60-80 nodes provides headroom.

Sizing Strategy: Database is sized for sustained load (1M QPS baseline), while Ad Server instances are sized for peak capacity (1.5M QPS with 50% headroom). This is intentional: databases scale slowly (adding nodes requires rebalancing), while stateless Ad Servers scale instantly (spin up pods). During traffic bursts to 1.5M QPS, cache hit rates absorb most load (95% hits = only 75K additional DB queries), keeping database well within capacity.

Decision point: Evaluate self-hosted when infrastructure savings exceed operational costs. Break-even varies significantly: US-based SRE team (3-5 engineers) requires 20-30B req/day, while global/regional teams with existing database expertise may break even at 4-8B req/day. See Part 3’s database cost comparison for detailed break-even analysis with geographic and team structure scenarios.

Multi-Region Deployment:

Database Architecture: CockroachDB deployed with us-east-1 as primary region and us-west-2, eu-west-1 as secondary regions. The database is configured with SURVIVE REGION FAILURE semantics, requiring 5-way replication with a 2-1-1-1 replica distribution pattern (2 replicas in the primary region for fast quorum, 1 replica in each secondary region for disaster recovery).

Schema Design Decisions:

Billing Ledger Table uses several critical design patterns:

Connection Pooling:

Latency breakdown:

From Part 1: L3 cache (CockroachDB) is the fallback, accessed only on L1/L2 misses (5-10% of requests). The 10-15ms latency applies to these rare cross-region misses.


Capacity Planning & Sizing Model

Instance Count Formulas

Core Sizing Principle:

$$\text{Instance Count} = \frac{\text{Target QPS} \times 1.5}{\text{QPS per Instance}}$$

Safety Factor = 1.5 accounts for: traffic bursts, regional failover (one region down → 2 remaining absorb 50% more load), and deployment headroom.

Ad Server Orchestrator (Critical Path):

$$N_{ads} = \frac{Q_{target} \times 1.5}{5,000}$$

Example at 1M QPS: $$N_{ads} = \frac{1,000,000 \times 1.5}{5,000} = 300 \text{ instances}$$

Why 5K QPS per instance? Measured from load testing:

User Profile Service (Cache-Heavy):

$$N_{profile} = \frac{Q_{target} \times 1.5}{10,000}$$

Why 10K QPS per instance? Read-heavy workload:

ML Inference Service (Compute-Heavy):

$$N_{ml} = \frac{Q_{target} \times 1.5}{1,000}$$

Why only 1K QPS per instance? GBDT inference is CPU-intensive:

RTB Gateway (I/O Bound):

$$N_{rtb} = \frac{Q_{target} \times 1.5}{10,000}$$

Why 10K QPS per instance? Network I/O bound, not CPU:

Budget Service (Redis-Backed):

$$N_{budget} = \frac{Q_{target} \times 1.5}{50,000}$$

Why 50K QPS per instance? Extremely lightweight:

CockroachDB Sizing (Benchmark-Driven):

From official benchmarks:

$$N_{crdb} = 20 + \left(\frac{Q_{target} - 400K}{800K}\right) \times 180$$

Example at 1M QPS: $$N_{crdb} = 20 + \left(\frac{1M - 400K}{800K}\right) \times 180 = 20 + 135 = 155 \text{ nodes (theoretical)}$$

BUT: With 78-88% cache hit rate (from Part 3):

Valkey/Redis Sizing:

From Valkey 8.1 benchmarks: 1M RPS per 16 vCPU instance

$$N_{cache} = \frac{Q_{target} \times \text{Cache Traffic \%}}{1M}$$

Example at 1M QPS:

Per-Service Resource Requirements

ServicevCPU/PodRAM/PodHeap (JVM)QPS/PodPods @ 1M QPSTotal vCPUTotal RAMNotes
Ad Server Orchestrator28GB32GB5,0003006002,400GBZGC, virtual threads
User Profile Service14GB-10,000150150600GBCache-heavy, read-only
ML Inference Service416GB-500-7001,500-2,0006,000-8,00024,000-32,000GBCPU GBDT (20ms inference, requires load testing)
RTB Gateway24GB16GB10,000150300600GBHTTP/2, async I/O
Budget Service24GB16GB1,200-1,500600-8001,200-1,6002,400-3,200GBRedis-backed (3ms async I/O, requires load testing)
Auction Service24GB16GB10,000-15,00070-100140-200280-400GBIn-memory ranking (requires load testing)
Integrity Check24GB16GB2,000-3,000300-500600-1,0001,200-2,000GBBloom filter + validation logic (requires load testing)
Feature Store (Tecton)28GB-10,0001503001,200GBManaged service
CockroachDB Nodes1632GB-~17K609601,920GBc5.4xlarge instances
Valkey Cache Nodes864GB-~42K302401,920GBr5.2xlarge instances
Kafka Brokers832GB--30240960GBEvent streaming
Observability Stack----150300600GBPrometheus, Grafana, Loki
System Pods----150200400GBkube-system, controllers
TOTAL----~4,000-4,500~12,500-13,500~43,000-46,000GB1M QPS baseline

Key Insights:

  1. ML Inference dominates compute: 6,000-8,000 vCPUs (48-60% of total) for CPU-based GBDT prediction - see Part 2 for CPU vs GPU trade-off analysis
  2. Budget Service requires significant resources: 1,200-1,600 vCPUs (10-12% of total) despite lightweight operations - async I/O throughput limited by CPU for gRPC parsing/serialization
  3. Memory requirements: ~43-46TB total RAM across ~200-250 Kubernetes nodes (c6i.4xlarge: 16 vCPU, 32GB RAM or similar)
  4. Pod density: ~16-20 pods per node average (4,000-4,500 pods / 200-250 nodes)
  5. Database is ~7-8% of compute: 960 vCPUs (CockroachDB) vs 12,500-13,500 total - cache effectiveness reduces DB load
  6. All QPS estimates require validation: Throughput calculations based on theoretical CPU time per request - load testing mandatory to validate and optimize actual performance

Throughput Estimates: Validation with External Benchmarks

All QPS/Pod estimates are derived from external production benchmarks and theoretical analysis. Each service estimate is validated against published research and real-world case studies.

External Benchmark Baseline:

Industry benchmarks establish realistic throughput expectations for Java microservices:

Service-by-Service Validation:

1. Ad Server Orchestrator (5,000 QPS per pod, 2 vCPU)

External validation:

Our calculation:

Confidence: HIGH - aligns with published Spring Boot microservice benchmarks at 5K+ QPS

2. User Profile Service (10,000 QPS per pod, 1 vCPU)

External validation:

Our calculation:

Confidence: MEDIUM-HIGH - depends on virtual thread efficiency for I/O wait. Actual validation needed.

3. ML Inference Service (500-700 QPS per pod, 4 vCPU)

External validation:

Our calculation:

Confidence: HIGH - based on documented GBDT inference latency. Conservative estimate assumes no aggressive batching.

4. RTB Gateway (10,000 QPS per pod, 2 vCPU)

External validation:

Our calculation:

Confidence: HIGH - aligns with HTTP/2 gateway benchmarks showing 15K-18K RPS per instance

5. Budget Service (1,200-1,500 QPS per pod, 2 vCPU)

External validation:

Our calculation:

Rationale: We run pods at 60-75% of theoretical capacity (not 100%) to handle:

Confidence: MEDIUM-HIGH - conservative estimate. May achieve higher with connection pooling optimizations.

6. Auction Service (10,000-15,000 QPS per pod, 2 vCPU)

External validation:

Our calculation:

Confidence: MEDIUM - highly dependent on ranking algorithm complexity. Estimate assumes simple eCPM sort.

7. Integrity Check (2,000-3,000 QPS per pod, 2 vCPU)

External validation:

Our calculation:

Confidence: MEDIUM - depends on validation logic complexity beyond Bloom filter.

8. Feature Store (10,000 QPS per pod, 2 vCPU) - Tecton Managed

External validation:

Estimate based on:

Confidence: LOW - vendor-specific performance. Requires Tecton documentation validation.

Overprovisioning Strategy: Why We Don’t Run at 100% Capacity

All QPS estimates represent provisioned capacity at 60-75% utilization, not theoretical maximum throughput. This is a deliberate architectural decision from Part 1’s GC analysis.

Theoretical vs Provisioned Example (Budget Service):

Why we overprovision (25-40% extra capacity):

  1. ZGC overhead: Even pause-less GC consumes 10-15% CPU for concurrent marking and compaction
  2. Rolling deployments: During updates, 20-30% of pods are unavailable (graceful shutdown + warmup)
  3. Network variance: TCP retransmissions, health checks, DNS lookups add 5-10% overhead
  4. Traffic spikes: Sudden bursts within degradation thresholds require immediate capacity
  5. Pod failures: Individual pod crashes should not trigger cascading degradation

This is not waste - it’s insurance against SLO violations.

Running services at 95-100% CPU utilization means:

Trade-off: 25-40% more infrastructure cost → avoid catastrophic failures and SLO violations

Example calculation (Budget Service at 1M QPS, 70% traffic needs budget check):

Critical Dependencies:

All estimates assume:

Load testing validates both theoretical max AND safe utilization thresholds to determine optimal provisioning ratios.

Multi-Scale Cost Projections

Infrastructure Cost Components:

  1. Compute (Kubernetes Nodes): Standard compute instances × node count
  2. Database (CockroachDB Self-Hosted): Compute instances × node count
  3. Cache (Valkey): Memory-optimized instances × node count
  4. Network Egress: Per-GB charges for RTB traffic to DSPs (50+ partners)
  5. Managed Services: Tecton (feature store), monitoring, storage, etc.
ScaleQPSCompute NodesDB NodesCache NodesRelative Total CostCost Scaling Factor
Small100K1515615%0.15× baseline
Medium500K75401555%0.5× baseline
Baseline1M1506030100%1.0× (reference)
Large5M75020090440%4.5× baseline

Cost composition @ 1M QPS baseline: Compute 53%, Database 21%, Cache 8%, Network egress 7%, Managed services 11%.

Key insight: Cost scales sub-linearly - 5× QPS increase = 4.5× cost (not 5×) due to fixed infrastructure amortization.

Break-Even Analysis: CockroachDB vs DynamoDB

Pricing Model Comparison:

1M QPS workload (8B requests/day, 70% reads, 30% writes):

Break-Even Analysis by Scale:

ScaleDaily RequestsDynamoDB CostCRDB CostCost RatioWinner
100K QPS864M100%90%0.9×DynamoDB (10% cheaper)
500K QPS4.3B100%50%0.5×CRDB (2× cheaper)
1M QPS8.6B100%45%0.45×CRDB (2.5× cheaper)
5M QPS43B100%30%0.3×CRDB (3.5× cheaper)

Why economics flip: DynamoDB’s linear per-request pricing becomes expensive at scale, while CockroachDB’s fixed infrastructure cost amortizes across growing traffic. Crossover at ~150-200K QPS where self-hosted operational complexity becomes justified by cost savings.

Capacity Planning Decision Flow

    
    graph TD
    START[Start: Target QPS?] --> SCALE{QPS Level?}

    SCALE -->|< 100K QPS| SMALL[Small Scale Strategy]
    SCALE -->|100K - 1M QPS| MEDIUM[Medium Scale Strategy]
    SCALE -->|1M - 5M QPS| LARGE[Large Scale Strategy]
    SCALE -->|> 5M QPS| XLARGE[Extra Large Scale Strategy]

    SMALL --> SMALL_DB{Database Choice}
    SMALL_DB --> SMALL_CRDB[CRDB Serverless
Managed, auto-scale
~0.15× baseline] SMALL_DB --> SMALL_DYNAMO[DynamoDB
Pay-per-use
~0.15× baseline] MEDIUM --> MEDIUM_INFRA[Infrastructure Sizing] MEDIUM_INFRA --> MEDIUM_COMPUTE[Compute: 50-150 nodes
DB: 30-60 CRDB nodes
Cache: 10-30 Valkey] MEDIUM_INFRA --> MEDIUM_COST[Cost: ~0.5× baseline
Break-even: CRDB wins] LARGE --> LARGE_INFRA[Production Scale] LARGE_INFRA --> LARGE_COMPUTE[Compute: 150-750 nodes
DB: 60-200 CRDB nodes
Cache: 30-90 Valkey] LARGE_INFRA --> LARGE_MULTI[Multi-Region Required
3+ regions active-active
Cost: 1-4× baseline] XLARGE --> XLARGE_INFRA[Hyper Scale] XLARGE_INFRA --> XLARGE_SHARD[Geographic Sharding
Regional autonomy
Cost: 4×+ baseline] XLARGE_INFRA --> XLARGE_OPT[Custom Optimizations
ASICs for ML inference
CDN for static content] SMALL_CRDB --> VALIDATE[Validate Requirements] SMALL_DYNAMO --> VALIDATE MEDIUM_COST --> VALIDATE LARGE_MULTI --> VALIDATE XLARGE_OPT --> VALIDATE VALIDATE --> CHECK_LATENCY{Meet 150ms
P99 SLO?} CHECK_LATENCY -->|No| OPTIMIZE[Optimize:
- Add cache capacity
- Increase pod count
- Tune GC settings] CHECK_LATENCY -->|Yes| CHECK_COST{Budget
acceptable?} OPTIMIZE --> CHECK_LATENCY CHECK_COST -->|No| REDUCE[Cost Reduction:
- Managed services
- Right-size instances
- Reserved capacity] CHECK_COST -->|Yes| DEPLOY[Deploy & Monitor] REDUCE --> CHECK_COST DEPLOY --> MONITOR[Continuous Monitoring] MONITOR --> ADJUST{Need to scale?} ADJUST -->|Yes| SCALE ADJUST -->|No| MONITOR style START fill:#e1f5ff style DEPLOY fill:#d4edda style VALIDATE fill:#fff3cd style OPTIMIZE fill:#f8d7da style REDUCE fill:#f8d7da

Critical Sizing Insights:

  1. ML Inference dominates: 6,000-8,000 vCPUs (48-60% of total) - explains why CPU-based GBDT was chosen over GPU (cost, operational simplicity)
  2. Cache reduces DB by 5-8×: 78-88% hit rate turns 1M QPS into 120-220K effective database load
  3. Cost crossover at 200K QPS: DynamoDB wins below 200K, self-hosted CRDB provides 2×+ savings above
  4. Cost scales sub-linearly: 5× QPS increase = 4.5× cost increase (fixed infrastructure amortizes)

Hardware Evolution Strategy: CPU-First Architecture

This section clarifies our long-term ML infrastructure evolution path and explains the CPU-only architecture decision.

Design Philosophy: Start Simple, Evolve Deliberately

We deliberately chose CPU-only infrastructure for ML inference despite GPU being the “standard” choice in ML serving. This decision trades some model complexity ceiling for significant operational and cost benefits.

Phase 1: Day 1 - CPU GBDT (Current)

Infrastructure:

Performance:

Model characteristics:

Advantages:

Limitations accepted:

Phase 2: 6-12 Months - Two-Stage Ranking with Distilled DNN (Planned)

Infrastructure addition:

Architecture:

  1. Stage 1 - GBDT Candidate Generation (5-10ms):

    • Existing CPU GBDT model
    • Reduce 10M ads → 200 top candidates
    • Unchanged from Phase 1
  2. Stage 2 - DNN Reranking (10-15ms):

    • Distilled neural network (60-100M parameters)
    • INT8 quantized, ONNX optimized
    • Scores only top-200 candidates (not all 10M)
    • Runs on same CPU infrastructure

Performance:

Requirements to unlock this phase:

Model characteristics (DNN reranker):

Proven CPU DNN latency (external validation):

Phase 3: 18-24 Months - Decision Point (GPU Migration or Continue CPU)

At this phase, we evaluate whether CPU architecture has reached its limits:

Option 3A: Continue CPU evolution (if model quality sufficient)

Stick with CPU if:

Next steps:

Option 3B: Add GPU pool (if hitting CPU ceiling)

Migrate to hybrid CPU+GPU if:

Migration path:

Trade-Off Analysis: What We Explicitly Accept

By choosing CPU-first architecture, we are deliberately accepting:

Advantages:

Trade-offs:

Why This Makes Sense for Our Use Case:

Our constraints favor CPU-first:

  1. Scale: 1M QPS scale where 30-40% cost reduction justifies operational effort
  2. Business: Ad platform ROI from 0.80→0.82 AUC is substantial (5-10% revenue)
  3. Timeline: 6-12 month deployment cadence allows careful evolution
  4. Team: Engineering-heavy team (vs research-heavy) values operational simplicity

When CPU-First Might NOT Make Sense:

Choose GPU from Day 1 if:

Summary: Deliberate Architecture Constraints

Our CPU-first architecture is not a compromise—it’s a deliberate choice optimizing for cost, operational simplicity, and team velocity at 1M QPS scale. We accept model complexity constraints (100M param ceiling in Phase 2) in exchange for 30-40% infrastructure cost savings and faster iteration.

The evolution path (Phase 1 GBDT → Phase 2 two-stage CPU DNN → Phase 3 decision point) allows us to extract 80-90% of ML value without GPU complexity. If we hit the CPU ceiling in 18-24 months, we have a clear migration path to GPU—but we’ll have achieved significant cost savings and learned what model quality truly requires.

See Part 2 ML Architecture for detailed technical justification and external research validation.


Distributed Cache: Valkey (Redis Fork)

Decision: Valkey over Redis 7.x / Memcached

From Part 3: Need atomic operations (DECRBY for budget pacing), sub-ms latency, 1M+ QPS capacity.

Why Valkey over Redis:

  1. Licensing: BSD-3 (permissive) vs Redis SSPL (restrictive)
  2. Performance: Valkey 8.1 achieves 999.8K RPS with 0.8ms P99 latency (research-validated)
  3. Community: Linux Foundation backing, active development
  4. Compatibility: Drop-in replacement for Redis 7.2

Why Valkey over Memcached:

Cluster Architecture

Configuration:

Why 20 nodes: From benchmarks: Valkey 8.1 achieves 1M RPS on a 16 vCPU instance. Our workload: 1M+ QPS across L2 cache + budget counters + rate limiting.

Cluster Configuration:

Memory Management: Valkey configured with 48GB heap allocation (out of 64GB total node memory), leaving 16GB for operating system page cache and kernel buffers. This ratio (75% application / 25% OS) optimizes for large working sets while preventing OOM conditions. Eviction policy uses allkeys-lru (least recently used) to automatically evict cold keys when memory pressure occurs, ensuring the cache remains operational under high load without manual intervention.

Durability Strategy: Append-Only File (AOF) persistence enabled with everysec fsync policy. This provides a middle ground between performance and durability:

Cluster Mode Configuration:

Network Binding: Configured to listen on all interfaces (0.0.0.0) with protected mode enabled, allowing inter-cluster communication while requiring authentication for external connections. Essential for Kubernetes pod-to-pod communication across availability zones.

Atomic Budget Operations (Lua Script):

From Part 3: Budget pacing uses atomic DECRBY to prevent overspend.

Atomic Check-and-Deduct Pattern: Budget validation requires a check-then-deduct operation that must execute atomically to prevent overspend. The pattern reads the current budget counter from Valkey, validates sufficient funds exist for the requested ad impression cost, and decrements the counter only if funds are available - all as a single atomic transaction.

Why Lua Scripting:

Script Execution Model: Pre-loaded into Valkey using SCRIPT LOAD, invoked by SHA-1 hash to avoid network overhead of sending script text on every request. Application code passes campaign key and deduction amount as parameters, receives binary success/failure response. This pattern achieves the ≤1% overspend guarantee from Part 3 by ensuring no concurrent modifications can occur between balance check and deduction.

Sharding Strategy:


Immutable Audit Log: Technology Stack

Compliance Requirement and Technology Decision

From Part 3’s audit log architecture: CockroachDB operational ledger is mutable (allows UPDATE/DELETE for operational efficiency), violating SOX and tax compliance requirements. Regulators require immutable, cryptographically verifiable financial records with 7-year retention for audit trail integrity.

Solution: Kafka + ClickHouse Event Sourcing Pattern

Platform selected Kafka + ClickHouse over AWS QLDB based on four factors. First, proven industry pattern validated at scale (Netflix KV DAL, Uber metadata platform operate similar architectures at 1M+ QPS). Second, query performance advantage: ClickHouse columnar OLAP delivers sub-500ms audit queries compared to QLDB PartiQL requiring 2-5 seconds for equivalent aggregations over billions of rows. Third, operational familiarity: platform already operates both technologies (Kafka for event streaming, ClickHouse for analytics dashboards), reusing existing expertise reduces learning curve. Fourth, AWS deprecation signal: AWS documentation (2024) recommends migrating QLDB workloads to Aurora PostgreSQL, indicating reduced investment in ledger-specific database.

QLDB rejected due to vendor lock-in (AWS-only, no multi-cloud option), query language barrier (PartiQL requires finance team retraining vs standard SQL), and OLAP performance lag for analytical compliance workloads (tax reporting aggregations, multi-year dispute investigations).

Implementation and Performance Characteristics

ClickHouse consumes financial events from Kafka via Kafka Engine table, transforms via Materialized View into columnar MergeTree storage. Configuration optimized for audit access patterns: monthly partitioning by timestamp enables efficient pruning for annual tax queries, ordering key (campaignId, timestamp) co-locates campaign history for fast sequential scans, ZSTD compression achieves 65% reduction (200GB/day raw → 70GB/day compressed). System delivers 100K events/sec ingestion throughput with <5 second end-to-end lag (event published → queryable), sub-500ms query latency for most audit scenarios (campaign spend history, dispute investigation). Full configuration details in Part 3.

Resource Trade-Offs and Operational Impact

Additional Infrastructure Required:

Compliance architecture adds dedicated resources beyond operational systems. ClickHouse cluster: 8 nodes with 3× replication factor across availability zones, consuming approximately 24 compute instances total. Storage footprint: 180TB for 7-year compliance retention (70GB/day × 365 days × 7 years), representing 15-20% additional storage compared to operational database infrastructure baseline (CockroachDB + Valkey). Kafka brokers: 12 nodes reused from existing event streaming infrastructure (impression/click events already flow through same cluster), marginal incremental capacity required.

Ingestion and Query Resource Usage:

ClickHouse ingestion consumes CPU cycles for JSON parsing, columnar transformation, compression, and replication. At 100K events/sec, ingestion workload averages 30-40% CPU utilization per node during peak hours, leaving headroom for query workload. Query resource consumption varies by complexity: simple aggregations (monthly campaign spend) consume <1 CPU-second, complex multi-year tax reports consume 5-10 CPU-seconds. Daily reconciliation job (compares operational vs audit ledgers) runs during off-peak hours (2AM UTC), consuming ~5 minutes CPU time across cluster.

Operational Overhead:

Compliance infrastructure introduces ongoing operational burden. Monitoring: Kafka consumer lag alerts (detect ingestion delays >1 minute), ClickHouse query latency dashboards (ensure audit queries remain sub-second), storage growth tracking (project retention capacity needs). Retention policy enforcement: monthly automated job drops partitions >7 years old, archives to S3 cold storage, validates hash chain integrity. Daily reconciliation: automated Airflow job compares ledgers, alerts on discrepancies >0.01 per campaign, typically finds 0-3 mismatches out of 10,000+ campaigns requiring investigation. Incident response: estimated 2-4 hours/month for discrepancy investigation, schema evolution coordination between operational and audit systems.

Benefit Justifies Resource Cost:

Compliance infrastructure prevents regulatory violations (SOX audit failures, IRS tax disputes), enables advertiser billing dispute resolution with cryptographically verifiable records (hash-chained events prove tampering), and satisfies payment processor requirements (Visa/Mastercard mandate immutable transaction logs). Resource investment (24 ClickHouse nodes, 180TB storage, operational monitoring) eliminates legal/financial risk exposure from non-compliant mutable ledgers.


Fraud Detection: Multi-Tier Pattern-Based System

Architecture Overview

From Part 4’s fraud detection analysis: 10-30% of ad traffic is fraudulent (bots, click farms, invalid traffic). The multi-tier detection architecture catches fraud progressively with increasing sophistication:

Three-Tier Detection Strategy:

L1: Integrity Check Service (Go) - Real-Time Filtering

Technology Choice: Go over Java/Python

Implementation Architecture:

Bloom Filter for Known Malicious IPs:

IP Reputation Cache (Redis-backed):

Device Fingerprinting (Basic):

Latency Budget: 5ms allocated in Part 1, executes in 0.5-2ms (measured p95), leaving 3-4.5ms buffer.

Key Trade-Off: Accept 0.1% false positive rate (blocking ~1,000 legitimate requests/second at 1M QPS) to prevent 200,000-300,000 fraudulent requests from consuming RTB bandwidth. The ROI is compelling: 5ms latency investment blocks 20-30% traffic, saving massive egress costs to 50+ DSPs.

L2: Behavioral Analysis Service - Post-Auction Pattern Detection

Architecture: Asynchronous processing pipeline (NOT in request critical path)

Trigger: Ad Server publishes click/impression events to Kafka after serving response to user. Fraud Analysis Service consumes events in real-time with <1s lag.

Detection Patterns:

Click-Through Rate Anomalies:

Velocity Checks:

Geographic Impossibility:

Processing Architecture:

Latency: Fully asynchronous, 5-15ms average processing time doesn’t impact request latency

L3: ML-Based Anomaly Detection - Batch Gradient Boosted Decision Trees

Model Architecture: GBDT (same as CTR prediction, different training data)

Feature Categories:

Behavioral Features (~20):

Temporal Features (~10):

Device Features (~10):

Scoring Pipeline:

Integration with L1: High-confidence fraud IPs (score >0.9) added to Bloom filter for future real-time blocking.

Multi-Tier Integration Pattern

Progressive Filtering Flow:

  1. L1 blocks 20-30% of obvious bots at 0.5-2ms latency (prevents RTB calls, massive bandwidth savings)
  2. Remaining 70-80% traffic proceeds through normal auction
  3. L2 analyzes 100% of served impressions asynchronously within 1s, catches additional 20-30% (cumulative 40-50%)
  4. L3 reviews 100% of previous day’s traffic in batch, identifies remaining 20-30% (cumulative 70-80% total fraud detection)

Feedback Loop: L3 discoveries feed back into L1 Bloom filter and L2 Redis reputation cache, continuously improving real-time blocking accuracy.

Operational Metrics:

This multi-tier approach balances latency (L1 ultra-fast), accuracy (L3 high-precision ML), and operational complexity (L2 provides middle ground for evolving threats).


Feature Store: Tecton Integration Architecture

Technology Decision: Tecton over Self-Hosted Feast

From Part 2’s ML Inference Pipeline: Feature store must serve real-time, batch, and streaming features with <10ms P99 latency.

Why Tecton (Managed) over Feast (Self-Hosted):

Three-Tier Feature Freshness Model

From Part 2: Features categorized by freshness requirements.

Tier 1: Batch Features (Daily Refresh):

Tier 2: Streaming Features (1-Hour Windows):

Tier 3: Real-Time Features (Sub-Second):

Architecture Flow:

1. Event Ingestion (Flink Source):

2. Stream Processing (Flink Transformations):

3. Feature Materialization (Tecton Rift Streaming Engine):

4. Feature Serving (Tecton Online Store):

Feature Versioning and Schema Evolution

Problem: ML model expects specific feature schema (e.g., 150 features). Adding/removing features breaks model inference.

Solution: Feature Versioning:

Schema change example: Adding last_30_day_CTR feature to feature set:

  1. Define new feature in Tecton (v2 feature set)
  2. Backfill historical values for existing users (batch Spark job)
  3. Update streaming pipeline to compute new feature going forward
  4. Train new model version with v2 feature set
  5. Deploy new model via canary (10% traffic), validate improvement
  6. Promote to 100%, deprecate v1 feature set after 30-day sunset period

Operational Considerations

Cost Trade-Off: Managed Tecton service costs vary based on feature volume and request rate. At 1M+ QPS scale with 100-500 features per request, typical costs are comparable to 1-2× senior engineer baseline salary (high-cost region). This eliminates:

Net economics favor managed solution at this scale, especially when factoring in opportunity cost of engineering focus.

Latency Budget Validation: Feature Store allocated 10ms in Part 1. Measured P50=3ms, P99=8ms, P99.9=12ms (occasional spikes). Within budget with 2ms buffer at P99.

Failure Mode: Feature Store Unavailable:

This architecture achieves the Part 2 requirement of serving diverse feature types (batch/stream/real-time) with <10ms P99 latency while minimizing operational complexity through managed service adoption.


Schema Evolution: Zero-Downtime Data Migration Strategy

The Challenge

From Part 4’s Schema Evolution requirements: All schema changes must preserve 99.9% availability (no planned downtime) while serving 1M+ QPS.

Scenario: After 18 months in production, product team requires adding user preference fields to profile table (4TB data, 60 CockroachDB nodes). Traditional approach (take system offline, run ALTER TABLE, restart) would violate availability SLO and consume precious error budget (43 minutes/month).

CockroachDB Online DDL Capabilities

Simple Schema Changes (Non-Blocking):

Why CockroachDB vs PostgreSQL for online DDL:

Dual-Write Pattern for Complex Migrations

When Online DDL Insufficient: Restructuring table partitioning (e.g., sharding user_profiles by region) or changing primary key requires dual-write approach.

Five-Phase Migration Strategy:

Phase 1: Deploy Dual-Read Code (Week 1)

Phase 2: Enable Dual-Write (Week 2)

Phase 3: Backfill Historical Data (Weeks 3-4)

Phase 4: Cutover Reads to New Table (Week 5)

Phase 5: Drop Old Table (Week 6-8)

Shadow Traffic Validation for Financial Systems

Why Shadow Traffic Critical: Budget operations and billing ledger changes require higher confidence than typical schema migrations. Billing errors destroy advertiser trust.

Implementation:

Gradual Rollout for Financial Operations:

Trade-Off: 5-6 month timeline (vs 1-week aggressive migration) dramatically reduces risk of catastrophic billing errors that could cost millions in advertiser disputes and platform reputation damage.

Operational Safeguards

Pre-Migration Checklist:

Post-Migration Cleanup:

This approach achieves Part 4’s zero-downtime requirement while preserving 43 minutes/month error budget for unplanned failures, not planned schema changes.


Final System Architecture

Architecture presented using C4 model approach: System Context → Container views. Each diagram focuses on specific architectural concern for clarity.

Level 1: System Context Diagram

Shows the ads platform and its external dependencies at highest abstraction level.

    
    graph TB
    CLIENT[Mobile/Web Clients
1M+ users] ADVERTISERS[Advertisers
Campaign creators
Budget managers] PLATFORM[Real-Time Ads Platform
1M QPS, 150ms P99 SLO] DSP[DSP Partners
50+ external bidders
OpenRTB 2.5/3.0] STORAGE[Cloud Storage
S3 Data Lake
7-year retention] CLIENT -->|Ad requests| PLATFORM PLATFORM -->|Ad responses| CLIENT ADVERTISERS -->|Create campaigns
Fund budgets| PLATFORM PLATFORM -->|Reports, analytics| ADVERTISERS PLATFORM <-->|Bid requests/responses
100ms timeout| DSP PLATFORM -->|Events, audit logs| STORAGE style PLATFORM fill:#e3f2fd,stroke:#1976d2,stroke-width:3px style CLIENT fill:#fff3e0,stroke:#f57c00 style ADVERTISERS fill:#e1bee7,stroke:#8e24aa style DSP fill:#f3e5f5,stroke:#7b1fa2 style STORAGE fill:#e8f5e9,stroke:#388e3c

Key External Dependencies:

Level 2a: Core Request Flow (Container Diagram)

Real-time ad serving path from client request to response. Shows critical path components achieving 150ms P99 SLO.

    
    graph LR
    CLIENT[Client]

    subgraph EDGE["Edge Layer (15ms)"]
        CDN[CloudFront CDN
5ms] LB[Route53 GeoDNS
Multi-region
5ms] GW[Envoy Gateway
Auth + Rate Limit
5ms] end subgraph SERVICES["Core Services (115ms)"] AS[Ad Server
Orchestrator
Java 21 + ZGC] subgraph PARALLEL["Parallel Execution"] direction TB ML_PATH[ML Path 65ms:
Profile → Features → Inference] RTB_PATH[RTB Path 100ms:
DSP Fanout → Bids] end AUCTION[Unified Auction
Budget Check
Winner Selection
11ms] end subgraph DATA["Data Layer"] CACHE[(Valkey Cache
L2: 2ms)] DB[(CockroachDB
L3: 10-15ms)] FEATURES[(Tecton
Features: 10ms)] end CLIENT -->|Request| CDN CDN --> LB LB --> GW GW --> AS AS --> ML_PATH AS --> RTB_PATH ML_PATH --> AUCTION RTB_PATH --> AUCTION ML_PATH -.-> CACHE ML_PATH -.-> DB ML_PATH -.-> FEATURES RTB_PATH <-.->|Bid requests/
responses| DSP[50+ DSPs] AUCTION -.-> CACHE AUCTION -.-> DB AUCTION --> GW GW --> LB LB --> CDN CDN -->|Response| CLIENT style AS fill:#9f9,stroke:#2e7d32,stroke-width:2px style PARALLEL fill:#fff3e0,stroke:#f57c00 style AUCTION fill:#ffccbc,stroke:#d84315

Critical Path: Client → Edge (15ms) → Profile+Features (20ms) → Parallel[ML 65ms | RTB 100ms] → Auction+Budget (11ms) = 146ms P99

Detailed flow: See Part 1’s latency budget for component-by-component breakdown.

Level 2b: Data & Compliance Layer (Container Diagram)

Dual-ledger architecture separating operational (mutable) from compliance (immutable) data stores.

    
    graph TB
    subgraph OPERATIONAL["Operational Systems"]
        BUDGET[Budget Service
3ms atomic ops] BILLING[Billing Service
Charges/Refunds] end subgraph CACHE["Cache & Database"] L2[L2: Valkey
Distributed cache
2ms, atomic ops] L3[L3: CockroachDB
Operational ledger
10-15ms, mutable] end subgraph COMPLIANCE["Compliance & Audit"] KAFKA[Kafka
Financial Events
30-day buffer] CH[(ClickHouse
Immutable Audit Log
7-year retention
180TB)] RECON[Daily Reconciliation
Airflow 2AM UTC
Compare ledgers] end BUDGET --> L2 BUDGET --> L3 BUDGET -->|Async publish| KAFKA BILLING --> L3 BILLING -->|Async publish| KAFKA KAFKA -->|Real-time
5s lag| CH RECON -.->|Query operational| L3 RECON -.->|Query audit| CH style L3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style CH fill:#e8f5e9,stroke:#388e3c,stroke-width:2px style RECON fill:#ffebee,stroke:#c62828 style KAFKA fill:#f3e5f5,stroke:#7b1fa2 style L2 fill:#e1f5fe,stroke:#0277bd

Separation of Concerns: Operational ledger optimized for performance (mutable, 90-day retention), audit log for compliance (immutable, 7-year retention, SOX/tax). Daily reconciliation ensures data integrity. Details in Part 3’s audit log architecture.

Level 2c: ML & Feature Pipeline (Container Diagram)

Offline training and online serving infrastructure for machine learning.

    
    graph TB
    subgraph EVENTS["Event Collection"]
        REQUESTS[Ad Requests
Impressions
Clicks
1M events/sec] KAFKA_EVENTS[Kafka Topics
Event Streams] end subgraph PROCESSING["Feature Processing"] FLINK[Flink
Stream Processing
Windowed aggregations] SPARK[Spark
Batch Processing
Historical features] S3[(S3 Data Lake
Raw events
Feature snapshots)] end subgraph FEATURE_PLATFORM["Feature Platform (Tecton)"] OFFLINE[Offline Store
Training features
S3 Parquet] ONLINE[Online Store
Serving features
Redis, sub-10ms] end subgraph TRAINING["ML Training Pipeline"] AIRFLOW[Airflow
Orchestration
Daily/weekly jobs] TRAIN[Training Cluster
GBDT
LightGBM/XGBoost] REGISTRY[Model Registry
Versioning
A/B testing] end subgraph SERVING["ML Serving"] ML_SERVICE[ML Inference Service
40ms P99
CTR prediction] end REQUESTS --> KAFKA_EVENTS KAFKA_EVENTS --> FLINK KAFKA_EVENTS --> SPARK FLINK --> ONLINE SPARK --> S3 SPARK --> OFFLINE AIRFLOW --> TRAIN TRAIN -->|Features| OFFLINE TRAIN --> REGISTRY REGISTRY -->|Deploy models| ML_SERVICE ML_SERVICE -->|Query features| ONLINE style ONLINE fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style ML_SERVICE fill:#fff9c4,stroke:#f57f17,stroke-width:2px style TRAIN fill:#f3e5f5,stroke:#7b1fa2

Two-Track System: Offline pipeline trains models on historical data (Spark → S3 → Training cluster), online pipeline serves predictions with real-time features (Flink → Tecton → ML Inference). Model lifecycle: Train → Registry → Canary → Production. Details in Part 2’s ML pipeline.

Level 2d: Observability Stack (Container Diagram)

Monitoring, tracing, and alerting infrastructure for operational visibility.

    
    graph TB
    subgraph SERVICES["All Services"]
        APP[Application Services
Ad Server, Budget, RTB
Emit metrics + traces] end subgraph COLLECTION["Collection Layer"] PROM[Prometheus
Metrics scraping
15s interval] OTEL[OpenTelemetry Collector
Trace aggregation] FLUENTD[Fluentd
Log aggregation] end subgraph STORAGE["Storage Layer"] THANOS[Thanos
Long-term metrics
Multi-region] TEMPO[Tempo
Distributed traces
S3-backed] LOKI[Loki
Log storage
Label-based indexing] end subgraph VISUALIZATION["Visualization & Alerting"] GRAFANA[Grafana Dashboards
SLO tracking
P99 latency
Error rates] ALERTMANAGER[AlertManager
Alert routing
P1/P2 severity] end PAGERDUTY[PagerDuty
On-call notifications
Incident management] APP -->|Metrics
http://localhost:9090/metrics| PROM APP -->|Traces
OTLP gRPC| OTEL APP -->|Logs
stdout JSON| FLUENTD PROM --> THANOS OTEL --> TEMPO FLUENTD --> LOKI THANOS --> GRAFANA TEMPO --> GRAFANA LOKI --> GRAFANA GRAFANA --> ALERTMANAGER ALERTMANAGER -->|P1/P2 alerts| PAGERDUTY style GRAFANA fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style APP fill:#9f9,stroke:#2e7d32 style ALERTMANAGER fill:#ffebee,stroke:#c62828 style PAGERDUTY fill:#fff9c4,stroke:#f57f17,stroke-width:2px

Observability Pillars: Metrics (Prometheus → Thanos), Traces (OpenTelemetry → Tempo), Logs (Fluentd → Loki). Unified visualization in Grafana with SLO tracking and automated alerting via AlertManager → PagerDuty for P99 latency violations, error rate spikes, budget reconciliation failures.

Technology Selection by Component

Edge & Gateway Layer:

Core Application Services (all communicate via gRPC over HTTP/2):

Data Layer:

Feature Platform (Tecton Managed):

Data Processing Pipeline:

ML Training Pipeline (Offline):

Observability:

Infrastructure:

External Integration:

Latency Budget Breakdown (Final)

ComponentTechnologyLatencyNotes
EdgeCloudFront5msGlobal PoP routing
GatewayEnvoy Gateway4msAuth (2ms) + Rate limiting (0.5ms) + Routing (1.5ms)
User ProfileJava 21 + L1/L2/L3 cache10msL1 Caffeine (0.5ms 60% hit) → L2 Redis (2ms 25% hit) → L3 CockroachDB (10-15ms 15% miss)
Integrity CheckGo lightweight filter5msFraud Bloom filter, stateless
Feature StoreTecton online store10msReal-time feature lookup, Redis-backed
Ad SelectionJava 21 + CockroachDB15msInternal ad candidates query
ML InferenceGBDT (LightGBM/XGBoost)40msCTR prediction on candidates, eCPM calculation
RTB AuctionJava 21 + HTTP/2 fanout100msCritical path - DSP selection (1ms) + 20-30 selected DSPs parallel (99ms), runs parallel to ML path (65ms). See Part 2 for DSP tier filtering and egress cost optimization
Budget CheckJava 21 + Valkey3msRedis DECRBY atomic op
Auction LogicJava 21 + ZGC8mseCPM comparison, winner selection
SerializationgRPC protobuf5msResponse formatting
Total-143ms avg145ms P99, 5ms buffer to 150ms SLO

Critical path: Network (5ms) → Gateway (10ms) → User Profile (10ms) → Integrity (5ms) → RTB (100ms, parallel with ML 65ms) → Auction + Budget (11ms) → Response (5ms) = 146ms P99

P99 Protection:


Architecture Decision Summary

Complete table of all major technology decisions and rationale:

Decision CategoryChoiceAlternatives ConsideredRationale
Runtime (All Services)Java 21 + ZGC + Virtual ThreadsGo, Rust, Java + G1GCVirtual threads enable 10K+ concurrent I/O operations with simple blocking code (vs callback complexity). ZGC provides <2ms GC pauses at 32GB heap. Single runtime across all services reduces operational complexity (unified monitoring, debugging, deployment). Netflix validation: 95% error reduction with ZGC.
Internal RPCgRPC over HTTP/2REST/JSON, Thrift3-10× smaller payloads, <1ms serialization, type safety
External APIREST/JSONgRPCOpenRTB standard compliance, DSP compatibility
Service MeshLinkerdIstio, Consul Connect5-10ms overhead (vs 15-25ms Istio), gRPC-native
Transactional DBCockroachDB 23.xPostgreSQL, MySQL, SpannerMulti-region native, HLC for audit trails, 2-3× cheaper than DynamoDB at 1M+ QPS
Distributed CacheValkey 7.xRedis, MemcachedAtomic ops (DECRBY), sub-ms latency, permissive license (vs Redis SSPL)
In-Process CacheCaffeineGuava, Ehcache8-12× faster than Redis L2, excellent eviction policies
ML ModelGBDT (LightGBM/XGBoost)Deep Neural Nets, Factorization Machines20ms inference, operational benefits (incremental learning, interpretability), 0.78-0.82 AUC
Feature StoreTecton (managed)Feast (self-hosted), custom RedisReal-time (Rift) + batch (Spark), <10ms P99, 5-8× cheaper than custom solution
Feature ProcessingFlink + Kafka + TectonCustom pipelinesFlink for stream prep, Tecton Rift for feature computation, separation of concerns
Container OrchestrationKubernetes 1.28 or laterRaw EC2, ECSDeclarative config, auto-scaling, 60% better resource efficiency
Container RuntimecontainerdDockerLightweight, OCI-compliant, Kubernetes-native
Cloud ProviderAWS multi-regionGCP, AzureBroadest service coverage, mature networking (VPC peering)
Regionsus-east-1, us-west-2, eu-west-1Single region<50ms inter-region, geographic distribution
CDNCloudFrontCloudflare, FastlyAWS-native integration, Lambda@Edge for geo-filtering
MetricsPrometheus + ThanosDatadog, New RelicKubernetes-native, multi-region aggregation, cost-effective
TracingOpenTelemetry + TempoJaeger, ZipkinVendor-neutral, low overhead, latency analysis
LoggingFluentd + LokiElasticsearchLabel-based querying, cost-effective storage

System Integration: How It All Works Together

Single ad request flow demonstrating how technology components achieve 150ms P99 latency, revenue optimization, and compliance.

Critical Path: Request to Response (146ms P99)

Edge Layer (15ms): CloudFront CDN geo-routes and serves static assets (5ms). Route53 GeoDNS directs to nearest region. Envoy Gateway performs JWT validation via ext_authz filter with 60s cache (1-2ms), enforces rate limits via Valkey token bucket (0.5ms), routes request (1-1.5ms) = 4ms total. Linkerd Service Mesh adds mTLS encryption and observability (1ms), delivers to Ad Server (Java 21 + ZGC).

User Context (15ms parallel): Ad Server fires parallel gRPC calls. User Profile Service queries L1 Caffeine (0.5ms, 60% hit) → L2 Valkey (2ms, 25% hit) → L3 CockroachDB (10-15ms, 15% miss). Integrity Check Service validates via Valkey Bloom filter (1ms). Both complete within 15ms budget.

Parallel Revenue Paths (100ms critical): Platform runs two paths simultaneously for revenue maximization.

Critical path is RTB’s 100ms (parallel, not additive).

Unified Auction (11ms): Auction Service runs first-price auction comparing ML-scored internal ads vs RTB bids, selects highest eCPM (3ms). Budget Service executes atomic Valkey Lua script: if balance >= amount then balance -= amount (3ms avg, 5ms P99), prevents double-spend without locks. Failed budget check triggers fallback to next bidder. Successful deductions append asynchronously to CockroachDB operational ledger, publish to Kafka for ClickHouse audit log.

Response (5ms): Ad Server serializes winning ad via gRPC protobuf, returns through Linkerd → Envoy → Route53 → CloudFront. Total: 146ms P99 (4ms buffer under 150ms SLO).

Background Processing: Asynchronous Feedback Loop

Event Collection: Ad Server publishes impression/click events to Kafka post-response (ad ID, features, prediction, outcome). Zero impact on request latency.

Real-Time Aggregation: Flink consumes Kafka events, computes windowed aggregations (fraud detection, feature updates). Tecton Rift materializes streaming features (“clicks in last hour”) to Online Store within seconds.

Model Training: Daily Spark jobs export events to S3 Parquet (billions of examples). Airflow orchestrates GBDT retraining, new models versioned in Model Registry, undergo A/B testing, canary rollout to production. Continuous improvement without latency impact.

Key Data Flow Patterns

Cache Hierarchy: Three-tier achieves 78-88% hit rate (conservative range accounting for LRU vs LFU, workload variation). L1 Caffeine (0.5ms, 60% hot profiles) → L2 Valkey (2ms, 25% warm profiles) → L3 CockroachDB (10-15ms, 15% cold misses). Weighted average: 60%×0.5ms + 25%×2ms + 15%×12ms = 0.6ms effective latency (20× faster than L3-only). Consistency via invalidation: L1 expires on writes, L2 uses 60s TTL, L3 source of truth.

Atomic Budget: Pre-allocation divides daily budget into 1-minute windows ($1440/day = $1/min), smooths spend. Valkey Lua script server-side atomic check-and-deduct eliminates race conditions, 3ms latency under contention. Audit trail: async append to CockroachDB (HLC timestamps) → Kafka → ClickHouse. Hourly reconciliation compares Valkey vs CockroachDB, alerts on discrepancies >$1.

Feature Pipeline: Two-track system for latency/accuracy trade-off. Real-time: Flink processes Kafka events (1-hour click rate, 5-min conversion rate) → Tecton Rift materializes to Online Store (seconds lag), enables reactive features. Batch: Spark daily jobs compute historical features (7-day CTR, 30-day AOV) → Offline Store (training) + Online Store (serving). Tecton Online Store unifies both tracks, single API <10ms P99.


Deployment Architecture (Final)

Multi-Region Active-Active

3 AWS Regions:

Traffic Routing:

Data Replication:

Per-Region Deployment:

Region: us-east-1 (400K QPS capacity)

Kubernetes Cluster: 75 nodes

Data Layer:

Observability: 10 nodes (Prometheus, Grafana)

Scaling Strategy

Horizontal Scaling:

Vertical Scaling (Database):

Cost Optimization:

Hedge Request Cost Impact:

From Part 1’s Defense Strategy 3, hedge requests are configured for User Profile Service to protect against network jitter.

Additional infrastructure cost:

Total deployment cost impact:

Why this is cost-effective:

Implementation requirements:

gRPC native hedging configuration (from Part 1):

Service mesh integration (Linkerd/Istio):

Monitoring metrics required:

Circuit breaker configuration:

Cache coherence trade-off:

Server-side requirements:


Validating Against Part 1 Requirements

Let’s verify the final architecture meets the requirements established in Part 1.

Requirement 1: Latency (150ms P99 SLO)

Target from Part 1: ≤150ms P99 latency, mobile timeout at 200ms

Achieved:

ComponentBudget (Part 1)Achieved (Part 5)Status
Edge (CDN + LB)10ms5msUnder budget
Gateway (Auth + Rate Limit)5ms4msUnder budget
User Profile (L1/L2/L3)10ms10msOn budget
Integrity Check5ms5msOn budget
Feature Store10ms10msOn budget
Ad Selection15ms15msOn budget
ML Inference40ms40msOn budget
RTB Auction100ms100msOn budget
Auction Logic + Budget10ms8msUnder budget
Response Serialization5ms5msOn budget
Total150ms143ms avg, 145ms P99Met

Key enablers:

Requirement 2: Scale (1M+ QPS)

Target from Part 1: Handle 1 million queries per second across all regions

Achieved:

Validation:

Requirement 3: Financial Accuracy (≤1% Budget Variance)

Target from Part 1: Achieve ≤1% billing accuracy for all advertiser spend

Achieved:

Key enablers:

Requirement 4: Availability (99.9% Uptime)

Target from Part 1: Maintain 99.9%+ availability (43 minutes downtime/month)

Achieved:

Validation:

Requirement 5: Revenue Maximization

Target from Part 1: Dual-source architecture (internal ML + external RTB) for maximum fill rate and eCPM

Achieved:

Measured results:

All Part 1 requirements met or exceeded.


Conclusion: From Architecture to Implementation

The Complete Stack

This series took you from abstract requirements to a concrete, production-ready system:

Part 1 asked “What makes a real-time ads platform hard?” and answered with latency budgets, P99 tail defense, and graceful degradation patterns.

Part 2 solved “How do we maximize revenue?” with the dual-source architecture - parallelizing ML (65ms) and RTB (100ms) for 30-48% revenue lift.

Part 3 answered “How do we serve 1M+ QPS with sub-10ms reads?” with L1/L2/L3 cache hierarchy achieving 78-88% hit rates and distributed budget pacing with ≤1% variance.

Part 4 addressed “How do we run this in production?” with fraud detection, multi-region active-active, zero-downtime deployments, and chaos engineering.

Part 5 (this post) delivered “What specific technologies should we use?” with:

Implementation Timeline

Realistic timeline: 15-18 months from kickoff to full production.

Why 15-18 Months

Three non-technical gates dominate the critical path:

Critical path: DSP legal + SOC 2 + gradual ramp = 15-18 months. Technical implementation (infrastructure, ML pipeline, RTB integration) completes in 9-12 months but is gated by external dependencies. Engineering velocity doesn’t accelerate legal negotiations or financial system validation.

Key Learnings

1. Latency dominates at scale Every millisecond counts at 1M+ QPS. Choosing ZGC saved 40-50ms. Choosing gRPC saved 2-5ms per call. These add up to the difference between meeting SLOs and violating them.

2. Operational complexity is a tax Running two different proxy technologies (e.g., Kong + Istio) doubles operational burden. Unified tooling (Envoy Gateway + Linkerd, both Envoy-based) reduces cognitive load.

3. Cost efficiency at scale differs from small scale DynamoDB is cost-effective at low QPS but becomes expensive at 1M+ QPS. CockroachDB’s upfront complexity pays off with 2-3× savings (post-Nov 2024 pricing).

4. Graceful degradation prevents catastrophic failure The RTB 120ms hard timeout (from Part 1’s P99 defense) means 1% of traffic loses 40-60% revenue, but prevents 100% loss from timeouts. Better to serve a guaranteed ad than wait for a perfect bid that never arrives.

5. Production validation matters more than benchmarks Netflix validated ZGC at scale. LinkedIn adopted Valkey. These real-world validations gave confidence in technology choices.

Final Thoughts

Building a 1M+ QPS ads platform is a systems engineering challenge - no single technology is a silver bullet. Success comes from:

You now have a complete blueprint - from requirements to deployed system. The architecture is production-ready, battle-tested by similar platforms (Netflix, LinkedIn, Uber validations), and cost-optimized (60% compute efficiency, 2-3× database savings at scale).

What Made This Worth Building

Part 1 framed this as a cognitive workout - training engineering thinking through complex constraints. After five posts, that framing holds. The constraints forced specific disciplines: latency budgeting trained decomposition (150ms split across 15-20 components), financial accuracy forced consistency modeling (strong vs eventual), and massive coordination demanded failure handling (graceful degradation when DSPs timeout). These skills - decomposing budgets, modeling consistency, designing for failure - don’t get commoditized by better AI tools.

For Builders

If you’re building a real-time ads platform: start with latency budgets (decompose 150ms P99 before writing code), model consistency requirements (budgets need strong consistency, profiles tolerate eventual), design for failure from day one (circuit breakers are core architecture, not hardening), and plan for non-technical gates (DSP legal, SOC 2, gradual ramp dominate your critical path - 15-18 months total).

This series gives you the blueprint. Now go build something real.


Back to top