Architecting Real-Time Ads Platforms: A Distributed Systems Engineer's Design Exercise

Author: Yuriy Polyulya

Introduction

Full disclosure: I haven’t built a production ads platform. As a distributed systems engineer, this problem represents a compelling intersection of my interests: real-time processing, distributed coordination, strict latency requirements, and financial accuracy at scale.

What draws me to this challenge is the complexity beneath a deceptively simple facade. A user opens an app, sees a relevant ad in under 100ms, and the advertiser gets billed accurately. Straightforward? Not when you’re coordinating auctions across dozens of bidding partners, running ML predictions in real-time, maintaining budget consistency across regions, and handling over a million queries per second.

Here’s the scale I’m designing for:

I’ll walk through my thought process on:

  1. Requirements modeling and performance constraints
  2. High-level architecture and latency budgets
  3. Real-Time Bidding (RTB) integration via OpenRTB
  4. ML inference pipelines constrained to <50ms
  5. Distributed caching strategies achieving 95%+ hit rates
  6. Auction mechanisms and game theory
  7. Advanced topics: budget pacing, fraud detection, multi-region failover
  8. What breaks and why - failure modes matter

Cost Pricing Caveat: Throughout this post, I’ll discuss technology cost comparisons. Please note that pricing varies significantly by cloud provider, region, contract negotiations, and changes frequently. The figures I mention are rough approximations to illustrate relative differences for decision-making, not exact pricing. Always verify current pricing for your specific use case.


Part 1: Requirements and Scale Analysis

Functional Requirements

Multi-Format Ad Delivery: Support story, video, and carousel ads across iOS, Android, and web. Creative assets served from CDN ensuring sub-100ms first-byte time.

Real-Time Bidding Integration: Implement OpenRTB 2.5+ to communicate with 50+ DSPs simultaneously. The IAB mandates a 30ms auction timeout - challenging when network RTTs to some DSPs approach 150ms. Handle both programmatic and guaranteed inventory with distinct SLAs.

ML-Powered Targeting:

Campaign Management: Real-time performance metrics, A/B testing framework, frequency capping, quality scoring, and policy compliance.

Non-Functional Requirements

Latency Distribution: $$P(\text{Latency} \leq 100\text{ms}) \geq 0.95$$

Total latency is the sum across the request path: $$T_{total} = \sum_{i=1}^{n} T_i$$

With 5 services each taking 20ms, you’re already at 100ms with zero margin. This is why ruthless latency budgeting is critical.

Throughput: $$Q_{peak} \geq 1.5 \times 10^6 \text{ QPS}$$

Little’s Law gives us server requirements. With service time (S) and (N) servers: $$N = \frac{Q_{peak} \times S}{U}$$

I use (U = 0.7) (70% utilization) for headroom. Running at 90%+ utilization invites disaster.

Availability: Targeting 99.995% uptime: $$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}} \geq 0.9995$$

This allows 26 minutes downtime per month. A bad deploy can consume that in seconds.

Consistency Requirements:

Not all data needs identical guarantees:

Scale Estimation

Data Volume:

Storage:

Cache Sizing: For 95% hit rate with Zipfian distribution (α = 1.0), cache ~20% of dataset:


Part 2: System Architecture and Technology Stack

High-Level Architecture

    
    graph TD
    %% Client Layer
    CLIENT[Mobile/Web Client
iOS, Android, Browser] %% Edge Layer CLIENT -->|1. Creative Assets| CDN[CDN
CloudFront/Fastly] CLIENT -->|2. Ad Request| GLB[Global Load Balancer
GeoDNS + Health] %% Service Layer Entry GLB -->|3. Route to Region| GW[API Gateway
Kong: 3-5ms
Auth + Rate Limit] GW -->|4. Auth Complete| AS[Ad Server Orchestrator
100ms latency budget] %% Parallel Service Fork AS -->|5a. Fetch User| UP[User Profile Service
Target: 10ms] AS -->|5b. Retrieve Ads| AD_SEL[Ad Selection Service
Target: 15ms] AS -->|5c. Predict CTR| ML[ML Inference Service
Target: 40ms] AS -->|5d. Run Auction| RTB[RTB Auction Service
Target: 30ms] %% External DSPs RTB -->|6d. Bid Request| EXTERNAL[50+ DSP Partners
External Bidders] %% Data Layer - Redis UP -->|6a. Cache Lookup| REDIS[(Redis Cluster
1000 nodes)] AD_SEL -->|6b. Cache Lookup| REDIS %% Data Layer - Cassandra UP -->|7a. On Cache Miss| CASS[(Cassandra
RF=3, CL=QUORUM)] AD_SEL -->|7b. On Cache Miss| CASS %% Data Layer - Feature Store ML -->|6c. Feature Fetch| FEATURE[(Feature Store
Tecton)] %% Event Streaming Pipeline AS -->|8. Log Events| KAFKA[Kafka
100K events/sec] KAFKA -->|9a. Stream Events| FLINK[Flink
Stream Processing] FLINK -->|10a. Update Cache| REDIS FLINK -->|10b. Archive| S3[(S3/HDFS
Data Lake)] %% Batch Processing Pipeline S3 -->|9b. Daily Batch| SPARK[Spark
Batch Processing] SPARK -->|10c. Update Features| FEATURE %% ML Training Pipeline S3 --> AIRFLOW[Airflow
Orchestration] AIRFLOW -->|11. Schedule| TRAIN[Training Cluster
Daily CTR Model] TRAIN -->|12. Store Model| REGISTRY[Model Registry
Versioning] REGISTRY -->|13. Deploy| ML %% Observability AS -.->|Metrics| PROM[Prometheus
Metrics] AS -.->|Traces| JAEGER[Jaeger
Tracing] PROM -->|Visualize| GRAF[Grafana
Dashboards] classDef client fill:#e1f5ff,stroke:#0066cc,stroke-width:2px classDef edge fill:#fff4e1,stroke:#ff9900,stroke-width:2px classDef service fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef data fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef stream fill:#ffe0b2,stroke:#e65100,stroke-width:2px classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px class CLIENT client class CDN,GLB edge class GW,AS,UP,AD_SEL,ML,RTB service class REDIS,CASS,FEATURE data class KAFKA,FLINK,SPARK,S3,AIRFLOW,TRAIN,REGISTRY stream class EXTERNAL external class PROM,JAEGER,GRAF stream

Diagram explanation - Request Flow:

Steps 1-4 (Client → Service Layer):

  1. Creative Assets: Client fetches ad images/videos from CDN (separate from ad selection)
  2. Ad Request: Client sends request to Global Load Balancer
  3. Route to Region: GLB uses GeoDNS to route to nearest healthy region
  4. Authentication: Kong gateway authenticates (JWT), rate limits, and forwards to Ad Server

Steps 5a-5d (Parallel Service Calls): The Ad Server Orchestrator makes 4 parallel calls with independent latency budgets:

Steps 6a-6d (Data Access - Cache First): Each service first checks cache before hitting persistent storage:

Steps 7a-7b (Cache Miss → Database): On cache miss, services read from Cassandra (20ms p99). This happens for ~5% of requests with 95% cache hit rate.

Steps 8-10 (Data Pipeline - Post-Request): After serving the ad (asynchronously):

Steps 11-13 (ML Training Loop - Daily):

Observability (Continuous):

Latency Budget Decomposition

For 100ms total budget:

$$T_{total} = T_{network} + T_{gateway} + T_{services} + T_{serialization}$$

Network (10ms):

API Gateway (5ms):

Service Layer (75ms): Parallel execution bounded by slowest service:

With parallelization: 40ms (ML inference)

Remaining:

Total: 65ms with 35ms headroom for variance.

Technology Stack Decisions

API Gateway Selection

For a 5ms latency budget, the API gateway choice is critical. Here’s my detailed analysis:

TechnologyLatencyThroughputRate LimitingAuth MethodsOps Complexity
Kong3-5ms100K/nodePlugin-basedJWT, OAuth2, LDAPMedium
AWS API Gateway5-10ms10K/endpointBuilt-inIAM, Cognito, LambdaLow (managed)
NGINX Plus1-3ms200K/nodeLua scriptingCustom modulesHigh
Envoy2-4ms150K/nodeExtension filtersExternal authHigh

Decision: Kong

Why Kong won:

The breakdown within my 5ms budget:

Why I was tempted by NGINX Plus:

Why I ruled out AWS API Gateway:

At 1M QPS (~86B requests/day), AWS’s per-request pricing becomes problematic:

Cost comparison (approximate, varies by region/contract):

Kong self-hosted:

AWS API Gateway:

The cost difference at high sustained throughput is significant - potentially enough to fund multiple engineer salaries.

Event Streaming Platform Selection

Before choosing stream processing frameworks, I need the right event streaming backbone. This is a foundational decision:

TechnologyThroughput/PartitionLatency (p99)DurabilityOrderingScalability
Kafka100MB/sec5-15msDisk-based replicationPer-partitionHorizontal (add brokers/partitions)
Pulsar80MB/sec10-20msBookKeeper (distributed log)Per-partitionHorizontal (separate compute/storage)
RabbitMQ20MB/sec5-10msOptional persistencePer-queueVertical (limited)
AWS Kinesis1MB/sec/shard200-500msS3-backedPer-shardManual shard management

Decision: Kafka

Here’s my reasoning:

Partitioning strategy:

For 100K events/sec across 100 partitions: 1,000 events/sec per partition

Partition key: user_id % 100 ensures:

Pulsar consideration:

Pulsar’s architecture is genuinely elegant with storage/compute separation, but:

Why I didn’t pick RabbitMQ:

I like RabbitMQ for certain use cases, but:

Cost comparison (approximate, varies by region/contract):

Kafka self-hosted:

AWS Kinesis alternative:

The cost difference is substantial at high sustained throughput. Kinesis might make sense for bursty or low-volume workloads where operational simplicity matters more than cost.

Stream Processing: Flink

Batch Processing: Spark

Feature Store: Tecton

L2 Cache: Redis Cluster

Persistent Store: Cassandra

Container Orchestration Selection

Container orchestration is critical for handling GPU scheduling and auto-scaling. Here’s my comparison:

TechnologyLearning CurveEcosystemAuto-scalingMulti-cloudNetworking
KubernetesSteepMassive (CNCF)HPA, VPA, Cluster AutoscalerYes (portable)Advanced (CNI, Service Mesh)
AWS ECSMediumAWS-nativeTarget tracking, step scalingNo (AWS-only)AWS VPC
Docker SwarmEasyLimitedBasic (replicas)Yes (portable)Overlay networking
NomadMediumHashiCorp ecosystemAuto-scaling pluginsYes (portable)Consul integration

Decision: Kubernetes

Look, Kubernetes is the obvious choice, but let me explain why this isn’t just cargo-culting:

Unpopular opinion: Kubernetes is overly complex for 80% of use cases, but if you need GPU orchestration and multi-cloud portability, you’re kind of stuck with it.

Kubernetes-specific features critical for ads platform:

1. Horizontal Pod Autoscaler (HPA):

2. GPU Node Affinity:

3. Service Mesh with Sidecar Pattern (Istio/Envoy):

The sidecar pattern deploys a lightweight proxy (Envoy) alongside each service pod, intercepting all network traffic for observability, security, and resilience—without modifying application code.

    
    graph TD
    subgraph POD1["Pod: Ad Server (Zone A)"]
        APP1[Ad Server
Application] SIDECAR1[Envoy Sidecar
50-100MB] APP1 -->|"1. localhost:15001
0.1ms"| SIDECAR1 end subgraph POD2["Pod: ML Inference (Zone A)"] SIDECAR2[Envoy Sidecar
Connection Pool] APP2[ML Service
Application] SIDECAR2 -->|"5. localhost
0.1ms"| APP2 end subgraph POD3["Pod: ML Inference (Zone B)"] SIDECAR3[Envoy Sidecar
Connection Pool] APP3[ML Service
Application] SIDECAR3 -->|localhost| APP3 end subgraph POD4["Pod: User Profile (Zone A)"] SIDECAR4[Envoy Sidecar] APP4[User Profile
Application] SIDECAR4 -->|localhost| APP4 end SIDECAR1 -->|"2. HTTP/2 pooled
same-zone: 1ms"| SIDECAR2 SIDECAR1 -.->|"cross-zone: 3ms
(avoided by locality)"| SIDECAR3 SIDECAR1 -->|"3. HTTP/2 pooled
1ms"| SIDECAR4 SIDECAR2 -->|"4. Response
aggregated"| SIDECAR1 CONTROL[Istio Control Plane
istiod] CONTROL -.->|"Config: routes,
policies, certs"| SIDECAR1 CONTROL -.->|Config| SIDECAR2 CONTROL -.->|Config| SIDECAR3 CONTROL -.->|Config| SIDECAR4 SIDECAR1 -.->|"Telemetry:
traces, metrics"| OBSERV[Prometheus
Jaeger] classDef pod fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef sidecar fill:#fff4e1,stroke:#ff9900,stroke-width:2px classDef app fill:#e1f5ff,stroke:#0066cc,stroke-width:2px classDef control fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px class POD1,POD2,POD3,POD4 pod class SIDECAR1,SIDECAR2,SIDECAR3,SIDECAR4 sidecar class APP1,APP2,APP3,APP4 app class CONTROL,OBSERV control

Diagram explanation - Sidecar Network Flow:

  1. Application to Local Sidecar (0.1ms): Ad Server makes request to localhost:15001, hitting local Envoy sidecar via loopback interface—no network latency
  2. Sidecar to Sidecar (1ms same-zone): Envoy uses persistent HTTP/2 connection pool to downstream service’s sidecar. Connection already established, no TCP handshake.
  3. Parallel Requests: Single HTTP/2 connection multiplexes multiple requests (ML inference + User Profile fetch) simultaneously
  4. Response Path: Responses return through same pooled connections
  5. Sidecar to Application (0.1ms): Downstream sidecar forwards to local application via localhost

Without Sidecar (Direct Service Calls):

With Sidecar:

Network Optimization Benefits:

Operational Benefits:

Latency overhead: Sidecar adds 1-2ms per hop (network stack traversal: app → sidecar → network), but connection pooling and retry logic often save 3-5ms, resulting in net latency improvement of 1-3ms for typical request patterns.

Resource cost: Each Envoy sidecar consumes 50-100MB memory + 0.1-0.2 CPU cores. For 500 pods, this is 25-50GB RAM + 50-100 cores—significant but justified by operational benefits (automatic mTLS, retries, observability) that would otherwise require custom code in every service.

Why not AWS ECS?

ECS is tempting because it’s managed and cheaper:

Why not Docker Swarm?

I wish Swarm had succeeded. It’s so much simpler than Kubernetes. But:

Cost trade-off (rough comparison for ~100 nodes):

Kubernetes (managed EKS):

AWS ECS (Fargate):

So why choose Kubernetes despite slightly higher costs?

GPU support and multi-cloud portability matter for this use case. ECS Fargate has limited GPU support, and I prefer not being locked into AWS. The premium (perhaps 10-20% higher monthly costs) acts as insurance against vendor lock-in and provides proper GPU scheduling for ML workloads.

Model Serving: TensorFlow Serving

Observability Stack:

Deployment Strategy: Kubernetes with Autoscaling

Critical Path Analysis

    
    graph TD
    A[Request Arrives
t=0ms] A -->|5ms: Auth + Rate Limit| B[Gateway
t=5ms] B --> FORK{Parallel Fork} FORK -->|10ms budget| C[User Profile
Cache: 5ms
DB miss: 20ms] FORK -->|15ms budget| D[Ad Selection
Filter 1M ads → 100] FORK -->|40ms budget| E[ML Inference
BOTTLENECK
Feature: 10ms
Model: 30ms] FORK -->|30ms budget| F[RTB Auction
50 DSPs parallel] C -->|Completes t=15ms| G[Join Point
Wait for all
t=45ms] D -->|Completes t=20ms| G E -->|Completes t=45ms| G F -->|Completes t=35ms| G G -->|5ms: Rank by eCPM
Select winner| H[Auction Logic
t=50ms] H -->|5ms: Serialize JSON
Return to client| I[Response Sent
t=55ms] classDef normal fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef bottleneck fill:#ffebee,stroke:#d32f2f,stroke-width:3px classDef gateway fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef join fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef fork fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px class A,B,H,I gateway class C,D,F normal class E bottleneck class G join class FORK fork

Diagram explanation - Critical Path Timing:

Request Arrives (t=0ms): User opens app, triggers ad request to Global Load Balancer

Gateway (t=0→5ms):

Parallel Service Calls (t=5ms → t=45ms): Ad Server forks 4 parallel requests. Total latency bounded by slowest service (not sum):

  1. User Profile (10ms target, finishes t=15ms):

    • Redis cache hit (95%): 5ms
    • Cassandra on miss (5%): 20ms
    • Weighted average: 0.95×5 + 0.05×20 = 5.75ms
  2. Ad Selection (15ms target, finishes t=20ms):

    • Retrieve ~1M candidate ads from Cassandra
    • Apply filters (user interests, ad categories): 5ms
    • Reduce to top 100 candidates: 10ms
    • Total: 15ms
  3. ML Inference (40ms target, finishes t=45ms) - CRITICAL PATH:

    • Feature fetching: 10ms (from Tecton + Redis)
      • Real-time features (time, device): 0ms
      • Near-RT features (last hour CTR): Redis 5ms
      • Batch features (user segments): Tecton 5ms
    • Model inference: 30ms (TensorFlow Serving on GPU)
      • Load 150-dim feature vector
      • GBDT forward pass through 100 trees
      • Return CTR prediction for each of 100 ad candidates
    • Total: 40ms ← This determines overall latency
  4. RTB Auction (30ms target, finishes t=35ms):

    • Send OpenRTB requests to 50 DSPs in parallel
    • Wait for responses (hard 30ms timeout)
    • Collect ~40 valid bids (10 DSPs timeout)
    • Total: 30ms

Join Point (t=45ms): Ad Server waits for all 4 parallel calls to complete. Since ML Inference takes longest (40ms), join completes at t=45ms. The other services finish earlier but must wait.

Auction Logic (t=45ms → t=50ms):

Response (t=50ms → t=55ms):

Total end-to-end latency: 55ms (45ms headroom within 100ms SLA)

Why ML Inference is the bottleneck: Even though ML has the largest budget (40ms), it’s still the critical path because:

Optimization priority:

  1. Reduce ML feature fetch: Pre-warm cache, optimize feature store queries
  2. Accelerate model inference: Model quantization (FP32→INT8), batch optimization
  3. Parallelize within ML: Fetch features while loading model weights

Any improvement to ML Inference directly improves overall latency. Optimizing other services (e.g., User Profile 10ms→5ms) has no impact since they’re not on critical path.

Circuit Breaker Pattern

Prevent cascading failures with three states:

State transitions: $$\text{If } E(t) > 0.10 \text{ for } \Delta t \geq 60s, \text{ trip circuit}$$

$$T_{backoff} = T_{base} \times 2^{n}, \quad T_{base} = 30s$$


Part 3: Real-Time Bidding Integration

OpenRTB Protocol

    
    sequenceDiagram
    participant AS as Ad Server
    participant DSP1 as DSP #1
    participant DSP2 as DSP #2-50
    participant Auction as Auction Logic

    Note over AS,Auction: 100ms Total Budget

    AS->>AS: 1. Construct BidRequest
OpenRTB 2.x format
User context + Ad slot par Parallel DSP Calls (30ms timeout each) AS->>DSP1: 2a. HTTP POST /bid
OpenRTB BidRequest activate DSP1 Note over DSP1: DSP evaluates user
Checks budget
Runs own auction DSP1-->>AS: 3a. BidResponse
Price: $5.50
Creative ID deactivate DSP1 and AS->>DSP2: 2b. Broadcast to 50 DSPs
Parallel HTTP/2 connections activate DSP2 Note over DSP2: Each DSP responds
independently
Some may timeout DSP2-->>AS: 3b. Multiple BidResponses
[$3.20, $4.80, ...] deactivate DSP2 end Note over AS: 4. Timeout enforcement (30ms):
Discard late responses
Collected ~40 bids from 50 DSPs AS->>Auction: 5. Collected bids +
ML CTR predictions activate Auction Note over Auction: 6. Compute eCPM for each bid:
eCPM = bid × CTR × 1000
Sort by eCPM descending Auction->>Auction: 7. Run GSP Auction
Winner pays 2nd price Note over Auction: 8. Calculate price:
p = eCPM₂ / (CTR_winner × 1000) Auction-->>AS: 9. Winner + Price
DSP #1, $3.33 deactivate Auction AS-->>DSP1: 10. Win notification
(async, best-effort)
Creative URL Note over AS,Auction: Total elapsed: ~35ms
Within 100ms budget rect rgb(255, 240, 240) Note over AS,DSP2: Failure handling:
- 10 DSPs timeout (discarded)
- Circuit breaker trips if >50% fail
- Fallback to guaranteed inventory end

Diagram explanation - RTB Auction Flow:

Step 1 (Construct BidRequest): Ad Server builds OpenRTB 2.x compliant BidRequest containing:

Steps 2a-2b (Parallel Broadcast to DSPs): Ad Server simultaneously sends BidRequests to 50+ DSPs using:

Steps 3a-3b (DSP Responses): Each DSP independently:

Typical response rate: 80% of DSPs respond within 30ms, 20% timeout

Step 4 (Timeout Enforcement): After exactly 30ms, Ad Server:

Step 5 (Combine with ML CTR): Ad Server merges:

Steps 6-8 (GSP Auction): Auction logic executes Generalized Second-Price auction:

  1. Compute eCPM: For each bid, calculate effective CPM = bid × CTR × 1000
  2. Sort: Rank all bids by eCPM descending
  3. Select winner: Highest eCPM wins the impression
  4. Calculate price: Winner pays minimum needed to beat 2nd place (second-price auction)
    • Example: Winner bid $5.50 with CTR 0.15 (eCPM $825)
    • Second place eCPM $600
    • Winner pays: $600 / (0.15 × 1000) = $4.00 (less than their $5.50 bid)

Step 9 (Return Winner): Auction returns winning DSP ID and computed price to Ad Server

Step 10 (Win Notification): Ad Server sends async win notification to winning DSP (best-effort delivery):

Failure Handling (bottom note):

BidRequest (simplified):

{
  "id": "req_12345",
  "imp": [{
    "id": "1",
    "banner": {"w": 320, "h": 50},
    "bidfloor": 2.50
  }],
  "user": {
    "id": "hashed_user_id",
    "geo": {"country": "USA"}
  },
  "tmax": 30
}

Timeout Handling: Adaptive Strategy

Maintain per-DSP latency histograms (H_{dsp}). Set per-DSP timeout: $$T_{dsp} = \text{min}\left(P_{95}(H_{dsp}), 30ms\right)$$

Progressive Auction:

HTTP/2 Connection Pooling

Connection pool sizing: $$P = \frac{Q \times L}{N}$$

Example: 1000 QPS, 30ms latency, 10 servers → 3 connections/server

HTTP/2 benefits:

Geographic Distribution

Network latency bounded by speed of light: $$T_{propagation} \geq \frac{d}{c \times 0.67}$$

Example: NY to London (5,585 km) ≈ 28ms propagation - nearly entire RTB budget!

Solution: Regional deployment in US-East, US-West, EU, APAC reduces max distance from 10,000km to ~1,000km: $$T_{propagation} \approx 5ms$$

Savings: 23ms enables including 10+ additional DSPs within budget.


Part 4: ML Inference Pipeline

Feature Engineering Architecture

Features fall into three categories with different freshness requirements:

Real-Time (computed per request):

Near-Real-Time (pre-computed, 10s TTL):

Batch (pre-computed daily):

The feature engineering architecture supports three distinct freshness tiers, each optimized for different latency and accuracy requirements:

    
    graph TB
    subgraph "Real-Time Feature Pipeline"
        REQ[Ad Request] --> PARSE[Request Parser]
        PARSE --> CONTEXT[Context Features
time, location, device
Latency: 5ms] PARSE --> SESSION[Session Features
user actions
Latency: 10ms] end subgraph "Feature Store" CONTEXT --> MERGE[Feature Vector Assembly] SESSION --> MERGE REDIS_RT[(Redis
Near-RT Features
TTL: 10s)] --> MERGE REDIS_BATCH[(Redis
Batch Features
TTL: 24h)] --> MERGE end subgraph "Stream Processing Pipeline" EVENTS[User Events
clicks, views] --> KAFKA[Kafka
100K events/sec] KAFKA --> FLINK[Flink
Windowed Aggregation
last hour CTR] FLINK --> REDIS_RT end subgraph "Batch Processing Pipeline" S3[S3 Data Lake
Historical Data] --> SPARK[Spark Jobs
Daily] SPARK --> FEATURE_GEN[Feature Generation
User segments, 30-day CTR] FEATURE_GEN --> REDIS_BATCH end MERGE --> INFERENCE[ML Inference
TensorFlow Serving
Latency: 40ms] INFERENCE --> PREDICTION[CTR Prediction
0.0 - 1.0] classDef realtime fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef store fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef stream fill:#ffe0b2,stroke:#e65100,stroke-width:2px classDef batch fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef inference fill:#fff9c4,stroke:#fbc02d,stroke-width:3px class REQ,PARSE,CONTEXT,SESSION realtime class REDIS_RT,REDIS_BATCH,MERGE store class EVENTS,KAFKA,FLINK stream class S3,SPARK,FEATURE_GEN batch class INFERENCE,PREDICTION inference

Diagram explanation:

Real-Time Feature Pipeline (top-left): When an ad request arrives, the Request Parser extracts immediate context (time of day, device type, location) in ~5ms and session features (recent browsing history) in ~10ms. These features have zero staleness but limited richness.

Stream Processing Pipeline (bottom-left): User events (clicks, views) flow through Kafka at 100K events/sec to Flink, which computes windowed aggregations like “user’s CTR in the last hour.” These features update every 10 seconds (TTL) in Redis, balancing freshness with computational cost.

Batch Processing Pipeline (bottom-center): Daily Spark jobs process historical data from S3 to generate complex features like user demographic segments and 30-day CTR averages. These features update daily (TTL: 24h) but provide deep insights not possible in real-time.

Feature Store (center): The Feature Vector Assembly component merges all three tiers - real-time context, near-real-time behavioral signals, and batch-computed segments - into a single 150-dimensional vector. This assembled vector feeds ML Inference.

ML Inference (right): TensorFlow Serving receives the 150-dim feature vector and predicts CTR in <40ms, the critical path bottleneck in our system.

Feature Freshness Guarantees:

Latency SLA: $$P(\text{FeatureLookup} \leq 10ms) \geq 0.99$$

Achieved with:

Feature Vector Construction

$$x = [x_{user}, x_{ad}, x_{context}, x_{cross}]$$

Total: 150 features

Model Architecture Decision

Technology Selection: ML Model Architecture

For CTR prediction within a 40ms latency budget, the model family choice balances accuracy, inference speed, and operational complexity:

Model FamilyInference LatencyInfrastructureAccuracyOperational Complexity
Tree-based (GBDT)5-15ms (CPU)Standard computeHigh (0.78-0.82 AUC)Low - feature importance aids debugging
Deep Learning (DNN)20-40ms (GPU)GPU clustersHighest (0.80-0.84 AUC)High - black box, slow iteration
Linear (FM/FFM)3-8ms (CPU)Standard computeModerate (0.75-0.78 AUC)Low - interpretable weights

Selection rationale: Tree-based models (LightGBM/XGBoost) offer the best trade-off—meeting latency requirements on CPU infrastructure while delivering strong accuracy. Deep learning provides 2-3% AUC gains but requires GPU infrastructure (50-100× cost increase at 1M+ QPS) and exceeds our latency budget. Linear models are fastest but sacrifice too much accuracy for the marginal latency improvement.

Framework Comparison for Production Serving:

FrameworkLatency (p99)ThroughputGPU SupportDeploymentBest For
TensorFlow Serving15-25msHigh (batching)ExcellentgRPC/RESTDNN models, GPU inference
ONNX Runtime8-15msVery HighGoodEmbedded/RESTCross-framework, CPU optimization
Triton Inference Server10-20msHighestExcellentMulti-model batchingHeterogeneous GPU workloads
Native (LightGBM/XGBoost)5-12msHighLimitedCustom serviceTree models, lowest latency

Deployment approach: For GBDT models, native LightGBM serving wrapped in a lightweight gRPC service provides optimal latency. For future DNN models, TensorFlow Serving offers production-grade features (model versioning, A/B testing, request batching) with acceptable overhead. ONNX Runtime serves as a middle ground, offering hardware acceleration across CPU/GPU with framework portability.

Scaling Strategy:

Horizontal autoscaling based on request throughput using Kubernetes HPA. Target utilization: 70-80% to handle traffic spikes. GPU workloads require careful node selection to avoid scheduling on CPU-only nodes. Request batching (accumulate 5-10ms, batch size 32-64) amortizes overhead and improves GPU utilization by 3-5×.

Optimization techniques: Model quantization (FP32 → INT8) reduces memory footprint by 4× and improves throughput by 2-3× with minimal accuracy loss (<1% AUC degradation). Critical for cost efficiency at scale.


Part 5: Distributed Caching

Multi-Tier Cache Technology Selection

To achieve 95%+ cache hit rate with sub-10ms latency, I need three cache tiers with different characteristics:

L1 Cache (In-Process) Options:

TechnologyLatencyThroughputMemoryProsCons
Caffeine (JVM)~1μs10M ops/secIn-heapWindow TinyLFU eviction, lock-free readsJVM-only, GC pressure
Guava Cache~2μs5M ops/secIn-heapSimple API, widely usedLRU only, lower hit rate
Ehcache~1.5μs8M ops/secIn/off-heapOff-heap option reduces GCMore complex configuration

Decision: Caffeine - Superior eviction algorithm (Window TinyLFU) yields 10-15% higher hit rates than LRU-based alternatives. Benchmarks show 2× throughput vs Guava.

L2 Cache (Distributed) Options:

TechnologyLatency (p99)ThroughputClusteringData StructuresProsCons
Redis Cluster5ms100K ops/sec/nodeNative shardingRich (lists, sets, sorted sets)Lua scripting, atomic opsMore memory than Memcached
Memcached3ms150K ops/sec/nodeClient-side shardingKey-value onlyLower memory, simplerNo atomic ops, no persistence
Hazelcast8ms50K ops/sec/nodeNative clusteringRich data structuresJava integrationHigher latency, less mature

Decision: Redis Cluster

Memcached typically costs ~30% less than Redis for equivalent capacity, but Redis offers atomic operations and richer data structures that justify the premium for this use case.

L3 Persistent Store Options:

TechnologyRead Latency (p99)Write ThroughputScalabilityConsistencyProsCons
Cassandra20ms500K writes/secLinear (peer-to-peer)Tunable (CL=QUORUM)Multi-DC, no SPOFNo JOINs, eventual consistency
PostgreSQL15ms50K writes/secVertical + shardingACID transactionsSQL, JOINs, strong consistencyManual sharding complex
MongoDB18ms200K writes/secHorizontal shardingTunableFlexible schemaLess mature than Cassandra
DynamoDB10ms1M writes/secFully managedEventual/strongAuto-scaling, no opsVendor lock-in, cost at scale

Decision: Cassandra

PostgreSQL limitation: Vertical scaling hits ceiling around 50-100TB. Sharding (e.g., Citus) adds operational complexity comparable to Cassandra, but without peer-to-peer resilience.

DynamoDB cost analysis: At 8B requests/day: $$\text{Cost}_{DynamoDB} = \frac{8 \times 10^9}{1000} \times \$1.25 \times 10^{-6} \approx \$10K/day = \$300K/month$$

Cassandra on 100 nodes @ $500/node/month: $50K/month (6× cheaper at this scale).

Three-Tier Cache Hierarchy

    
    graph TB
    REQ[Cache Request
user_id: 12345] REQ --> L1[L1: Caffeine JVM Cache
10-second TTL
~1μs lookup
100MB per server] L1 --> L1_HIT{Hit?} L1_HIT -->|60% Hit| RESP1[Response ~1μs] L1_HIT -->|40% Miss| L2[L2: Redis Cluster
30-second TTL
~5ms lookup
1000 nodes × 16GB] L2 --> L2_HIT{Hit?} L2_HIT -->|35% Hit| POPULATE_L1[Populate L1] POPULATE_L1 --> RESP2[Response ~5ms] L2_HIT -->|5% Miss| L3[L3: Cassandra Ring
Multi-DC Replication
~20ms read
Petabyte scale] L3 --> POPULATE_L2[Populate L2 + L1] POPULATE_L2 --> RESP3[Response ~20ms] MONITOR[Stream Processor
Flink/Spark
Count-Min Sketch] -.->|0.1% sampling| L2 MONITOR -.->|Detect hot keys| REPLICATE[Dynamic Replication
3x copies for hot keys] REPLICATE -.->|Replicate to nodes| L2 PERF[Total Hit Rate: 95%
Average Latency: 2.75ms
p99 Latency: 25ms] classDef l1cache fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef l2cache fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef l3cache fill:#ffe0b2,stroke:#e65100,stroke-width:2px classDef response fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef monitor fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef stats fill:#e0f7fa,stroke:#00838f,stroke-width:3px class REQ,RESP1,RESP2,RESP3 response class L1,L1_HIT l1cache class L2,L2_HIT l2cache class L3 l3cache class MONITOR,REPLICATE monitor class PERF stats class POPULATE_L1,POPULATE_L2 response

Diagram explanation:

L1 (In-Process Cache): Caffeine cache stores the hottest 10% of data (100MB per server) with 10-second TTL. Sub-microsecond lookups serve 60% of requests. When L1 misses, the request cascades to L2.

L2 (Distributed Cache): Redis Cluster with 1000 nodes (16GB each) stores 20% of total data with 30-second TTL. Serves 35% of total requests (87.5% of L1 misses) in ~5ms. Consistent hashing distributes keys across nodes.

L3 (Persistent Store): Cassandra provides the source of truth with multi-datacenter replication (RF=3). Handles remaining 5% of requests in ~20ms. On cache miss, data populates both L2 and L1 for subsequent requests.

Hot Key Detection: Stream processor samples 0.1% of traffic using Count-Min Sketch to detect celebrity users generating 100× normal load. Hot keys are dynamically replicated 3× across Redis nodes to prevent single-node saturation.

Overall Performance: 95% total hit rate with 2.75ms average latency, meeting our <10ms SLA.

Hit Rate Calculation: $$H_{total} = H_1 + (1-H_1)H_2 + (1-H_1)(1-H_2)H_3$$ $$= 0.60 + 0.40 \times 0.35 + 0.40 \times 0.65 \times 1.0 = 0.95$$

Average Latency: $$\mathbb{E}[L] = 0.60 \times 0.001 + 0.35 \times 5 + 0.05 \times 20 = 2.75ms$$

Redis Cluster Sharding

Hash slot assignment: $$\text{slot}(k) = \text{CRC16}(k) \mod 16384$$

1000 nodes, 16,384 slots → ~16 slots per node

Load variance with virtual nodes: $$\text{Var}[\text{load}] = \frac{\mu}{n \times v}$$

1000 QPS, 1000 nodes, 16 virtual nodes → standard deviation ≈ 25% of mean

Hot Partition Mitigation: Count-Min Sketch

Why Count-Min Sketch?

Track frequency of millions of user IDs without gigabytes of memory.

ApproachMemoryAccuracyUse Case
Exact HashMapO(n) = GBs100%Too expensive
SamplingMinimalMisses rare keysIneffective
Count-Min Sketch5.4 KB99% ± 1%Perfect fit

Configuration: width=272, depth=5

Error bound: $$\text{estimate}(k) \leq \text{true}(k) + \frac{e \times N}{w}$$

For 100M requests/hour: error <30%, sufficient for detecting >1000 req/sec hot keys.

Dynamic Replication: When frequency >100 req/sec:

  1. Replicate key to (R=3) nodes
  2. Update routing table
  3. Client load-balances reads across replicas

Per-replica load: (\lambda_{replica} = \lambda / R = 1000 / 3 ≈ 333) req/sec

Workload Isolation

Prevent batch workloads from interfering with serving:

Cassandra data-center aware replication:

CREATE KEYSPACE user_profiles
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us_east': 2,      // Serving replicas
  'eu_central': 1    // Batch replica
};

Batch writes: CONSISTENCY LOCAL_ONE (EU-Central) Serving reads: CONSISTENCY LOCAL_QUORUM (US-East)

Cost: Dedicate 33% of replicas to batch, but serving latency unaffected.

Cache Invalidation

Hybrid Strategy:


Part 6: Auction Mechanisms

Generalized Second-Price (GSP)

Effective bid (eCPM): $$\text{eCPM}_i = b_i \times \text{CTR}_i \times 1000$$

Winner selection: $$w = \arg\max_{i} \text{eCPM}_i$$

Price (second-price): $$p_w = \frac{\text{eCPM}_{2nd}}{\text{CTR}_w \times 1000} + \epsilon$$

Example:

AdvertiserBidCTReCPMRank
A5.000.105002
B4.000.156001 ← Winner
C6.000.053003

Winner B pays: (\frac{500}{0.15 \times 1000} = 3.33) (bid 4.00 but only pays enough to beat A)

Game Theory: GSP vs VCG

GSP is not truthful (proven by counterexample in auction theory literature). Advertisers can increase profit by bidding below true value. However:

Why use GSP?

  1. Higher revenue than VCG in practice
  2. Simpler to explain
  3. Industry standard (Google Ads)
  4. Nash equilibrium exists

Computational Complexity:

For 50 DSPs: GSP=282 ops, VCG=14,100 ops

Recommendation: GSP for real-time, VCG for offline optimization.

Reserve Price Optimization

Optimal reserve price (Myerson 1981): $$r^* = \arg\max_r \left[ r \times \left(1 - F(r)\right) \right]$$

Balance: high reserve → more revenue/ad but fewer filled; low reserve → more filled but lower revenue/ad.

For uniform distribution (F(v) = v/v_{max}): $$r^* = \frac{v_{max}}{2}$$

Empirical approach: Use 50th percentile (median) of historical bids as reserve.


Part 7: Advanced Topics

Budget Pacing: Distributed Spend Control

Problem: Advertisers set daily budgets (e.g., $10,000/day). In a distributed system serving 1M QPS, how do we prevent over-delivery without centralizing every spend decision?

Challenge:

Centralized approach (single database tracks spend):

Solution: Pre-Allocation with Periodic Reconciliation

    
    graph TD
    ADV[Advertiser X
Daily Budget: $10,000] ADV --> BUDGET[Budget Controller] BUDGET --> REDIS[(Redis
Atomic Counters)] BUDGET --> POSTGRES[(PostgreSQL
Audit Log)] BUDGET -->|Allocate $50| AS1[Ad Server 1] BUDGET -->|Allocate $75| AS2[Ad Server 2] BUDGET -->|Allocate $100| AS3[Ad Server 3] AS1 -->|Spent: $42
Return: $8| BUDGET AS2 -->|Spent: $68
Return: $7| BUDGET AS3 -->|Spent: $95
Return: $5| BUDGET BUDGET -->|Hourly reconciliation| POSTGRES TIMEOUT[Timeout Monitor
5min intervals] -.->|Release stale
allocations| REDIS REDIS -->|Budget < 10%| THROTTLE[Dynamic Throttle] THROTTLE -.->|Reduce allocation
size $100→$10| BUDGET classDef advertiser fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef controller fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef server fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef storage fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef monitor fill:#ffe0b2,stroke:#e65100,stroke-width:2px class ADV advertiser class BUDGET,THROTTLE controller class AS1,AS2,AS3 server class REDIS,POSTGRES storage class TIMEOUT monitor

Diagram explanation:

Budget Controller (center): Manages the advertiser’s $10,000 daily budget across hundreds of ad servers. Uses Redis for fast atomic operations and PostgreSQL for durable audit logs.

Pre-Allocation (Budget → Ad Servers): Instead of checking budget on every impression, ad servers request budget chunks upfront (e.g., $50-$100). The Budget Controller atomically decrements the remaining budget using Redis DECRBY, preventing race conditions across distributed servers.

Return Unused Budget (Ad Servers → Budget): After serving ads, servers report actual spend. If they allocated $100 but only spent $95, they return $5 using Redis INCRBY. This prevents over-locking budget.

Timeout Monitor: Every 5 minutes, scans for allocations older than 10 minutes with no spend report. These likely represent crashed servers holding budget hostage. Automatically returns their allocations to the pool.

Dynamic Throttle: When budget drops below 10%, reduces allocation chunk size from $100 → $10. This prevents over-delivery when approaching budget exhaustion.

PostgreSQL Audit Log: All allocations and spends are logged to PostgreSQL for billing reconciliation and debugging, providing a durable audit trail.

Budget Allocation Algorithm:

The core algorithm has three operations:

1. Request Allocation:

2. Report Spend:

3. Timeout Monitor (Background):

Key design decisions:

Over-delivery bound: $$\text{OverDelivery}_{max} = S \times A$$

Where (S) = number of servers, (A) = allocation per server

Example: 100 servers × 1% daily budget allocation = 100% max over-delivery (entire daily budget)

Mitigation: When budget <10%, reduce allocation: $$A_{new} = \frac{B_r}{S \times 10}$$

Reduces max over-delivery to ~1%.

Fraud Detection: Multi-Tier Filtering

Problem: Detect and block fraudulent ad clicks in real-time without adding significant latency.

Fraud Types:

  1. Click Farms: Bots or paid humans generating fake clicks
  2. SDK Spoofing: Fake app installations reporting ad clicks
  3. Domain Spoofing: Fraudulent publishers misrepresenting site content
  4. Ad Stacking: Multiple ads layered, only top visible but all “viewed”

Detection Strategy: Multi-Tier Filtering

    
    graph TB
    REQ[Ad Request/Click] --> L1{L1: Bloom Filter
Simple Rules
0ms overhead} L1 -->|Known bad IP| BLOCK1[Block
Bloom Filter] L1 -->|Pass| L2{L2: Behavioral
5ms latency} L2 -->|Suspicious pattern| PROB[Probabilistic Block
50% traffic] L2 -->|Pass| L3{L3: ML Model
Async, 20ms} L3 -->|Fraud score > 0.8| BLOCK3[Post-Facto Block
Refund advertiser] L3 -->|Pass| SERVE[Serve Ad] PROB --> SERVE SERVE -.->|Log for analysis| OFFLINE[Offline Analysis
Update models] OFFLINE -.->|Update rules| L1 OFFLINE -.->|Retrain model| L3 classDef request fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef filter fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef block fill:#ffebee,stroke:#d32f2f,stroke-width:2px classDef warning fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef success fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef async fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px class REQ request class L1,L2,L3 filter class BLOCK1,BLOCK3 block class PROB warning class SERVE success class OFFLINE async

Diagram explanation:

L1 (Bloom Filter): Every request first checks against a Bloom filter containing 100K known fraudulent IPs. This adds virtually zero latency (~1 microsecond) and immediately blocks obvious repeat offenders. The filter has 0.01% false positive rate but zero false negatives - we never accidentally allow known fraudsters.

L2 (Behavioral Heuristics): Requests that pass L1 enter behavioral scoring with <5ms overhead. This synchronous layer uses simple rules to score suspicious patterns without ML complexity.

L3 (ML Model): After serving the ad (to maintain low latency), an async ML model analyzes the request with 50+ features. High fraud scores (>0.8) trigger post-facto blocking and advertiser refunds.

Offline Analysis: Logs from all tiers feed back into model retraining and rule updates, creating a continuous improvement loop.

L1: Bloom Filter for IP Blocklist

Why Bloom filters are ideal for fraud IP blocking:

At 1M+ QPS, we need to check every incoming request against a blocklist of known fraudulent IPs. A naive hash table would require significant memory (several MB per server) and introduce cache misses.

Bloom filters solve this perfectly:

Space efficiency: 10⁷ bits (only 1.25 MB) can represent 100K IPs with 0.01% false positive rate. That’s 100× more space-efficient than a hash table storing full IP addresses. The entire data structure fits in L2 cache, enabling sub-microsecond lookups.

Zero false negatives: If a Bloom filter says an IP is NOT in the blocklist, it’s guaranteed correct. We never accidentally allow known fraudsters through. The only risk is false positives (0.01% chance of blocking a legitimate IP), which is acceptable - we have L2/L3 filters to catch legitimate users.

Constant-time lookups: O(k) hash operations regardless of set size. With k=7 hash functions, we can check membership in <1 microsecond - adding virtually zero latency to the request path.

Lock-free reads: Multiple threads can query simultaneously without contention. Critical for handling 1M+ QPS across hundreds of cores.

L2: Behavioral Heuristics (<5ms overhead)

Scoring (0-1 scale, >0.6 triggers blocking):

Why not ML here? L2 must be synchronous (runs before ad serving). ML models (20ms) run async post-serving to avoid adding latency.

L3: ML Fraud Score (Async)

GBDT model with 50+ features:

Handling class imbalance (1% fraud):

Economic trade-off: $$\text{Cost} = C_{fp} \times FP + C_{fn} \times FN$$

Where (C_{fn} \gg C_{fp}) (advertiser refund ≫ lost revenue from blocking legitimate), we optimize for low false negatives (catch fraud even if we occasionally block legitimate traffic).

Multi-Region Failover

Timeline when US-East (60% traffic) fails:

T+10s: Load balancer detects failure T+30s: US-West sees 3x traffic, CPU 40%→85% T+60s: Auto-scaling triggered T+90s: Cache hit rate drops 60%→45% (different user distribution) T+150s: New instances online, capacity restored

Queuing Theory Insight:

Server utilization: (\rho = \lambda / (c \mu))

When (\rho > 0.8), queue length grows exponentially. When (\rho \geq 1.0), system is unstable.

Example:

Mitigation: Graceful Degradation

  1. Increase cache TTL: 30s → 300s
  2. Disable expensive features (skip ML, use rule-based targeting)
  3. Load shedding: Reject low-value traffic

Load Shedding Strategy:

When utilization >90%:

Impact: Preserves ~97.5% revenue while shedding 47.5% traffic.

Standby capacity calculation: $$c_{standby} = c_{normal} \times \left(\frac{\lambda_{spike}}{\lambda_{normal}} - 1\right)$$

For 3× spike: 100 servers × (3-1) = 200 servers (200% over-provisioning)

Cost: Expensive. Alternative: accept degraded performance during rare failures.


Part 8: Observability and Operations

Service Level Indicators and Objectives

Key SLIs:

ServiceSLITargetWhy
Ad APIAvailability99.9%Revenue tied to successful serves
Ad APILatencyp95 <100ms, p99 <150msMobile timeouts above 150ms
MLAccuracyAUC >0.78Below 0.75 = 15%+ revenue drop
RTBResponse Rate>80% DSPs within 30ms<80% = remove from rotation
BudgetConsistencyOver-delivery <1%>5% = refunds + complaints

Error Budget Policy (99.9% = 43 min/month):

When budget exhausted:

  1. Freeze feature launches (critical fixes only)
  2. Focus on reliability work
  3. Mandatory root cause analysis
  4. Next month: 99.95% target to rebuild trust

Incident Response Dashboard

When ML latency alert fires at 2 AM, show:

Top-line (what’s broken):

Resource utilization (service health):

Dependency latencies (find bottleneck):

Downstream health:

Historical context:

Auto-suggested runbook:

  1. Check nodetool cfstats for compactions
  2. If storm detected: nodetool stop compaction
  3. Tune compaction_throughput_mb_per_sec

Distributed Tracing

Single user reports “ad not loading” among 1M+ req/sec:

Request ID: 7f3a8b2c...
Total: 287ms (VIOLATED SLO)

├─ API Gateway: 2ms
├─ User Profile: 45ms
│  └─ Redis: 43ms (normally 5ms)
│     └─ TCP timeout: 38ms
│        └─ Cause: Node failure, awaiting replica
├─ ML Inference: 156ms
│  └─ Batch incomplete: 8/32
│     └─ Cause: Low traffic (Redis failure reduced QPS)
└─ RTB: 84ms

Root cause: Redis node failure → cascading slowdown. Trace shows exactly why.

Security and Compliance

Service-to-Service Auth (mTLS):

# Istio PeerAuthentication
mtls:
  mode: STRICT  # Reject non-mTLS

# Authorization: Ad Server can't call Budget DB directly
principals: ["cluster.local/ns/prod/sa/ad-server"]
paths: ["/predict"]  # ML inference only

PII Protection:

Secrets: Vault with Dynamic Credentials

ML Data Poisoning Protection:

    
    graph TD
    RAW[Raw Events] --> VALIDATE{Validation}

    VALIDATE --> CHECK1{CTR >3σ?}
    VALIDATE --> CHECK2{Low IP entropy?}
    VALIDATE --> CHECK3{Uniform timing?}

    CHECK1 -->|Yes| QUARANTINE[Quarantine]
    CHECK2 -->|Yes| QUARANTINE
    CHECK3 -->|Yes| QUARANTINE

    CHECK1 -->|No| CLEAN[Training Data]

    CLEAN --> TRAIN[Training] --> SIGN[GPG Sign]
    SIGN --> REGISTRY[Model Registry]

    REGISTRY --> INFERENCE[Inference]
    INFERENCE --> VERIFY{Verify GPG}

    VERIFY -->|Valid| LOAD[Load Model]
    VERIFY -->|Invalid| REJECT[Reject + Alert]

    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef check fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    classDef danger fill:#ffebee,stroke:#d32f2f,stroke-width:2px
    classDef safe fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef process fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px

    class RAW input
    class VALIDATE,CHECK1,CHECK2,CHECK3,VERIFY check
    class QUARANTINE,REJECT danger
    class CLEAN,LOAD safe
    class TRAIN,SIGN,REGISTRY,INFERENCE process

Three validation checks:

  1. CTR anomaly: >3σ spike (e.g., 2%→8%)
  2. IP entropy: <2.0 (clicks from narrow range = botnet)
  3. Temporal pattern: Uniform intervals (human=bursty, bot=mechanical)

Model integrity: GPP signature prevents loading tampered models even if storage compromised.

Data Lifecycle and GDPR

Retention policies:

DataRetentionRationale
Raw events7 daysReal-time only; archive to S3
Aggregated metrics90 daysDashboard queries
Model training data30 daysOlder data less predictive
User profiles365 daysGDPR; inactive purged
Audit logs7 yearsLegal compliance

GDPR “Right to be Forgotten”:

Deletion across 10+ systems in parallel:

Verification: All systems confirm → send deletion certificate to user within 48h.


Part 9: Production Operations at Scale

Progressive Rollout Strategy

    
    graph TD
    CODE[Main Branch] --> BUILD[CI Build]

    BUILD --> CANARY{Canary
1%, 15min} CANARY -->|OK| STAGE1[10%
US-East, 30min] CANARY -->|Fail| ROLLBACK1[Auto-rollback] STAGE1 -->|OK| STAGE2[50%
US-East+West, 1hr] STAGE1 -->|Fail| ROLLBACK2[Auto-rollback] STAGE2 -->|OK| STAGE3[100%
All regions, 24hr] STAGE2 -->|Fail| ROLLBACK3[Auto-rollback] STAGE3 --> COMPLETE[Complete] classDef build fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef stage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef rollback fill:#ffebee,stroke:#d32f2f,stroke-width:2px classDef success fill:#e8f5e9,stroke:#4caf50,stroke-width:3px classDef decision fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px class CODE,BUILD build class STAGE1,STAGE2,STAGE3 stage class ROLLBACK1,ROLLBACK2,ROLLBACK3 rollback class COMPLETE success class CANARY decision

Automated gates:

  1. Error rate: <0.1% increase
  2. Latency: p95 <110ms, p99 <150ms
  3. Revenue drop: <2%
  4. Circuit breakers: no new opens
  5. Database health: latency within range

Feature flags for blast radius control:

ml_inference_v2:
  enabled: true
  rollout: 5%
  targeting: ["internal_testing"]
  kill_switch: true

Benefits:

Deep Observability: Error Budgets in Practice

Error budget example (99.9% = 43 min/month):

Burn rate alert: If consuming >10× rate, outage in <3 hours → page on-call.

Why 99.9% not 99.99%?

Cost Management at Scale

Resource attribution by team:

ResourceOwnerCharge-back Model
Redis (1000 nodes)PlatformGB-hours per service
Cassandra (500 nodes)DataStorage + IOPS per table
ML GPUs (100× A100)MLGPU-hours (training vs inference)
Kafka (200 brokers)DataGB ingested + stored per topic
K8s (5000 pods)ComputevCPU-hours + RAM-hours per namespace

Cost optimization:

  1. Right-sizing: CPU usage should average 70% (not 30%)
  2. Spot instances for ML training: 70% cheaper, tolerate interruptions
  3. Tiered storage: Hot (7d) → Warm (30d) → Cold (365d) = 67% savings
  4. Reserved instances: 60% baseline reserved = 0.6× rate, 40% on-demand

Relative cost relationships:

Key metrics:


Part 10: Strategic Considerations

Architecture to Business Value

Every technical decision maps to business KPIs:

ArchitectureTechnical MetricBusiness KPIImpact
Sub-100ms latencyp95 <100msCTR+100ms = -1-2% CTR
Multi-tier caching95% hit rateInfrastructure cost20× database load reduction
ML CTR (AUC >0.78)Model accuracyRevenue/impressionAUC 0.75→0.80 = +10-15% revenue
Budget pacingOver-delivery <1%Advertiser retention>5% causes churn
Fraud detectionFalse positive <0.1%Platform reputation1% invalid traffic = complaints
Progressive rolloutsDeployment failure <1%VelocitySaves 500 engineer-hours/year
Error budgets99.9% SLOSLA complianceBreach = 10-30% penalty
GDPR deletion<48h SLALegal complianceViolation = 4% revenue fine

Key insights:

  1. Latency is revenue: Sub-100ms is competitive advantage
  2. ML quality compounds: 0.05 AUC improvement = 10-15% revenue
  3. Operational excellence enables velocity: Ship 3-5× faster with 10× lower risk
  4. Cost efficiency unlocks scale: Optimizing vCPU-ms per request compounds significantly at 1M+ QPS
  5. Compliance is existential: GDPR fines can be 4% of global revenue

Evolutionary Architecture

Scenario 1: Video Ads

New requirements: 500KB-2MB files, multi-bitrate streaming, VAST 4.0

Changes:

Stays same:

Migration: 6-month phased rollout, 1%→50%→100% traffic

Cost impact: Video CDN 10-20× more expensive. At 10M impressions/day × 1MB = 10TB/day CDN egress.

Scenario 2: Transformer Models

New requirements: GBDT → Transformer for 15-20% CTR improvement

Changes:

Stays same:

A/B test requirement:

MetricControl (GBDT)Treatment (Transformer)Decision
Revenue/impressionBaseline1.143× baseline+14.3% (good)
P99 latency87ms103ms+18.4% (concerning)

Don’t ship yet. Options:

  1. Optimize inference to <95ms
  2. Find 10ms savings elsewhere
  3. Negotiate new SLO: “103ms acceptable for 14% revenue lift”

Balancing User Trust vs Revenue

Core principle: Optimize for long-term user trust, which is the foundation for sustainable revenue.

Guardrail metrics (all experiments must pass):

GuardrailThresholdWhy
Session durationMust not decrease >2%Ads driving users away
Latency p99<100msSlow = reduced CTR + satisfaction
Ad block rateMust not increase >0.5%Poor quality signal
Error rate<0.1%Technical failures erode trust

Example: Interstitial Ads Experiment

MetricControlTreatmentDeltaDecision
Revenue/userBaseline1.47× baseline+47%Good
Session duration12.3 min9.8 min-20%Violated
Ad block rate2.1%3.9%+86%Violated

Do not ship. Short-term +47% revenue, but -20% session duration and +86% ad blocking signals unacceptable user experience. Invest in better ML models or native formats instead.

Designing for Failure: Chaos Engineering

SRE team mandate: Regularly inject failures into production during business hours.

Experiments:

Weekly: Pod termination

Bi-weekly: Latency injection

Monthly: Regional failover

Quarterly: Database failure

Culture:

Blast radius limits:

Managing Technical Debt

20% of engineering capacity per cycle allocated to debt paydown.

Why 20%?

Prioritization framework:

ImpactVelocityRiskPriorityExample
CriticalHighHighP0DB migration with no rollback
HighHighLowP1Monolithic service blocking 3 teams
MediumLowHighP1Flaky test (10% failure)
LowLowLowP2Code style inconsistency

Cultural norm: “Leave it better than you found it”


Part 11: Resilience and Failure Scenarios

A robust architecture must survive catastrophic failures, security breaches, and business model pivots. This section addresses three critical scenarios:

Catastrophic Regional Failure: When an entire AWS region fails, our semi-automatic failover mechanism combines Route53 health checks (2-minute detection) with manual runbook execution to promote secondary regions. The critical challenge is budget counter consistency—asynchronous Redis replication creates potential overspend windows during failover. We mitigate this through pre-allocation patterns that limit blast radius to allocated quotas per ad server, bounded by replication lag multiplied by allocation size.

Malicious Insider Attack: Defense-in-depth through zero-trust architecture (SPIFFE/SPIRE for workload identity), mutual TLS between all services, and behavioral anomaly detection on budget operations. Critical financial operations like budget allocations require cryptographic signing with Kafka message authentication, creating an immutable audit trail. Lateral movement is constrained through Istio authorization policies enforcing least-privilege service mesh access.

Business Model Pivot to Guaranteed Inventory: Transitioning from auction-based to guaranteed delivery requires strong consistency for impression quotas. Rather than replacing our eventual-consistent stack, we extend the existing pre-allocation pattern—PostgreSQL maintains source-of-truth counters while Redis provides fast-path allocation with periodic reconciliation. This hybrid approach adds only 10ms to the critical path for guaranteed campaigns while preserving sub-millisecond performance for auction traffic. The 12-month evolution path reuses 80% of existing infrastructure (ML pipeline, feature store, Kafka) while adding campaign management and SLA tracking layers.

These scenarios validate that the architecture is not merely elegant on paper, but battle-hardened for production realities: regional disasters, adversarial threats, and fundamental business transformations.


Part 12: Organizational Design

Team Structure

1. Ad Platform Team

2. ML Platform Team

3. Data Platform Team

4. SRE/Infrastructure Team

Cross-Team Collaboration

API contracts as team boundaries:

ML Team provides:
  POST /ml/predict/ctr
  SLA: p99 <20ms, 99.95% availability

Ad Team consumes without knowing:
  - Model type (GBDT vs Transformer)
  - Feature computation
  - GPU cluster location

Shared observability: All teams contribute to shared dashboards showing business, ML, and infrastructure metrics.

Quarterly roadmap alignment:

TeamQ3 GoalsDependencies
Ad PlatformLaunch video adsData: Video CDN
ML: Engagement model
ML PlatformImprove AUC 0.78→0.82Data: Browsing history
SRE: 50 GPUs
Data PlatformGDPR deletion 48h→24hSRE: Cassandra upgrade
SREReduce cost 10%All: Adopt spot instances

Tiered on-call:

Example incident flow:

12:00 AM: Latency alert (250ms)
12:01 AM: L1 paged
12:02 AM: L1 sees ML latency spike → escalate L2
12:05 AM: L2 finds GPU throttling → escalate L3
12:10 AM: L3 identifies HVAC failure → failover
12:15 AM: Latency restored

Conclusion

This design exercise explored both Day 1 architecture and Day 2 operational realities for a large-scale ads platform:

Architecture:

Operations:

Key Takeaways:

Latency Budget Discipline: Every millisecond counts. Decompose the 100ms budget and optimize critical paths.

Caching is Critical: 95%+ hit rates achievable with three-tier architecture (L1 in-process + L2 Redis + L3 database).

Mathematical Foundations: From queueing theory to game theory to probability, rigorous analysis prevents costly production issues.

Distributed Systems Trade-Offs: Choose consistency level based on data type: strong for financial, eventual for preferences.

Graceful Degradation: Circuit breakers, timeouts, fallbacks prevent cascading failures. Load shedding protects during overload.

Operations Are Not an Afterthought: Progressive rollouts, observability, security, and cost governance are essential for running at scale.

Security and Compliance Are Design Constraints: PII protection and GDPR require architectural decisions from day one. Retrofitting is orders of magnitude harder.

Cost Scales Faster Than Revenue: Without active management, infrastructure spending spirals. Treat cost per request as a first-class metric.

While this remains a design exercise, the patterns explored represent well-established distributed systems approaches. The value was applying principles - queueing theory, game theory, consistency models, resilience patterns - to real-time ad serving constraints.

For anyone actually building such a system:

  1. Start simpler - most don’t need 1M QPS from day one
  2. Evolve based on real traffic rather than speculative requirements
  3. Invest in observability early - you can’t optimize what you can’t measure
  4. Build security/compliance into foundation - nearly impossible to retrofit
  5. Track costs from the beginning - they compound quickly at scale

The complexity documented here represents years of evolution. Understanding it helps make better early decisions that won’t need unwinding later.


References

Distributed Systems:

  1. Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly.
  2. Dean, J., & Barroso, L. A. (2013). The Tail at Scale. CACM, 56(2).

Caching: 3. Fitzpatrick, B. (2004). Distributed Caching with Memcached. Linux Journal. 4. Nishtala, R., et al. (2013). Scaling Memcache at Facebook. NSDI.

Auction Theory: 5. Vickrey, W. (1961). Counterspeculation, Auctions, and Competitive Sealed Tenders. Journal of Finance. 6. Edelman, B., et al. (2007). Internet Advertising and the Generalized Second-Price Auction. AER. 7. Myerson, R. (1981). Optimal Auction Design. Mathematics of Operations Research.

Machine Learning: 8. McMahan, H. B., et al. (2013). Ad Click Prediction: A View from the Trenches. KDD. 9. He, X., et al. (2014). Practical Lessons from Predicting Clicks on Ads at Facebook. ADKDD.

Real-Time Bidding: 10. IAB Technology Laboratory (2016). OpenRTB API Specification Version 2.5. 11. Yuan, S., et al. (2013). Real-Time Bidding for Online Advertising: Measurement and Analysis. ADKDD.

Fraud Detection: 12. Pearce, P., et al. (2014). Characterizing Large-Scale Click Fraud in ZeroAccess. ACM CCS.

Observability: 13. Beyer, B., et al. (2016). Site Reliability Engineering. O’Reilly. 14. Beyer, B., et al. (2018). The Site Reliability Workbook. O’Reilly. 15. Sigelman, B. H., et al. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google Technical Report.

Security & Compliance: 16. GDPR Regulation (EU) 2016/679. General Data Protection Regulation.

Cost Optimization: 17. AWS Well-Architected Framework (2024). Cost Optimization Pillar. 18. Cloud FinOps Foundation (2024). FinOps Framework.


Back to top