Architecting Real-Time Ads Platforms: A Distributed Systems Engineer's Design Exercise

15 October 2025 • Author: Yuriy Polyulya •

Introduction

Full disclosure: I haven’t built a production ads platform. As a distributed systems engineer, this problem represents a compelling intersection of my interests: real-time processing, distributed coordination, strict latency requirements, and financial accuracy at scale.

What draws me to this challenge is the complexity beneath a deceptively simple facade. A user opens an app, sees a relevant ad in under 100ms, and the advertiser gets billed accurately. Straightforward? Not when you’re coordinating auctions across dozens of bidding partners, running ML predictions in real-time, maintaining budget consistency across regions, and handling over a million queries per second.

Here’s the scale I’m designing for:

400M+ daily active users generating continuous ad requests
1M+ queries per second during peak traffic
Sub-100ms p95 latency for the entire request lifecycle
Real-time ML inference for click-through rate prediction
Distributed auction mechanisms coordinating with 50+ external bidding partners
Multi-region deployment with eventual consistency challenges

I’ll walk through my thought process on:

Requirements modeling and performance constraints
High-level architecture and latency budgets
Real-Time Bidding (RTB) integration via OpenRTB
ML inference pipelines constrained to <50ms
Distributed caching strategies achieving 95%+ hit rates
Auction mechanisms and game theory
Advanced topics: budget pacing, fraud detection, multi-region failover
What breaks and why - failure modes matter

Cost Pricing Caveat: Throughout this post, I’ll discuss technology cost comparisons. Please note that pricing varies significantly by cloud provider, region, contract negotiations, and changes frequently. The figures I mention are rough approximations to illustrate relative differences for decision-making, not exact pricing. Always verify current pricing for your specific use case.

Part 1: Requirements and Scale Analysis

Functional Requirements

Multi-Format Ad Delivery: Support story, video, and carousel ads across iOS, Android, and web. Creative assets served from CDN ensuring sub-100ms first-byte time.

Real-Time Bidding Integration: Implement OpenRTB 2.5+ to communicate with 50+ DSPs simultaneously. The IAB mandates a 30ms auction timeout - challenging when network RTTs to some DSPs approach 150ms. Handle both programmatic and guaranteed inventory with distinct SLAs.

ML-Powered Targeting:

Real-time CTR prediction (serving random ads leaves money on the table)
Conversion rate optimization
Dynamic creative optimization
Budget pacing algorithms (prevent advertisers from exhausting budgets in the first hour)

Campaign Management: Real-time performance metrics, A/B testing framework, frequency capping, quality scoring, and policy compliance.

Non-Functional Requirements

Latency Distribution: $$P(\text{Latency} \leq 100\text{ms}) \geq 0.95$$

Total latency is the sum across the request path: $$T_{total} = \sum_{i=1}^{n} T_i$$

With 5 services each taking 20ms, you’re already at 100ms with zero margin. This is why ruthless latency budgeting is critical.

Throughput: $$Q_{peak} \geq 1.5 \times 10^6 \text{ QPS}$$

Little’s Law gives us server requirements. With service time (S) and (N) servers: $$N = \frac{Q_{peak} \times S}{U}$$

I use (U = 0.7) (70% utilization) for headroom. Running at 90%+ utilization invites disaster.

Availability: Targeting 99.995% uptime: $$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}} \geq 0.9995$$

This allows 26 minutes downtime per month. A bad deploy can consume that in seconds.

Consistency Requirements:

Not all data needs identical guarantees:

Financial data (billing): Strong consistency - no exceptions $$\forall t_1 < t_2: \text{Read}(t_2) \text{ observes } \text{Write}(t_1)$$

If an advertiser pays for 1000 impressions, they get exactly 1000. Not 999, not 1001.
User preferences: Eventual consistency acceptable $$\lim_{t \to \infty} P(\text{AllReplicas consistent}) = 1$$

If a user updates interests and sees old targeting for 10-20 seconds during propagation, it’s not critical.
Ad inventory: Bounded staleness (~30 seconds) $$|\text{Read}(t) - \text{TrueValue}(t)| \leq \epsilon \text{ for } t - t_{write} \leq 30s$$

Slightly stale counts are acceptable. Over-serving a campaign by 50 impressions out of 100K is negligible and can be reconciled later.

Scale Estimation

Data Volume:

Daily requests: 400M users × 20 requests/user = 8B requests/day
Daily logs (1KB each): 8TB/day

Storage:

User profiles (10KB each): 4TB
30-day performance history (100B/impression): ~24TB

Cache Sizing: For 95% hit rate with Zipfian distribution (α = 1.0), cache ~20% of dataset:

Required cache: 400M users × 20% × 10KB ≈ 800GB
Distributed across 100 nodes: 8GB/node

Part 2: System Architecture and Technology Stack

High-Level Architecture

    
    graph TD
    %% Client Layer
    CLIENT[Mobile/Web Client
iOS, Android, Browser]

    %% Edge Layer
    CLIENT -->|1. Creative Assets| CDN[CDN
CloudFront/Fastly]
    CLIENT -->|2. Ad Request| GLB[Global Load Balancer
GeoDNS + Health]

    %% Service Layer Entry
    GLB -->|3. Route to Region| GW[API Gateway
Kong: 3-5ms
Auth + Rate Limit]
    GW -->|4. Auth Complete| AS[Ad Server Orchestrator
100ms latency budget]

    %% Parallel Service Fork
    AS -->|5a. Fetch User| UP[User Profile Service
Target: 10ms]
    AS -->|5b. Retrieve Ads| AD_SEL[Ad Selection Service
Target: 15ms]
    AS -->|5c. Predict CTR| ML[ML Inference Service
Target: 40ms]
    AS -->|5d. Run Auction| RTB[RTB Auction Service
Target: 30ms]

    %% External DSPs
    RTB -->|6d. Bid Request| EXTERNAL[50+ DSP Partners
External Bidders]

    %% Data Layer - Redis
    UP -->|6a. Cache Lookup| REDIS[(Redis Cluster
1000 nodes)]
    AD_SEL -->|6b. Cache Lookup| REDIS

    %% Data Layer - Cassandra
    UP -->|7a. On Cache Miss| CASS[(Cassandra
RF=3, CL=QUORUM)]
    AD_SEL -->|7b. On Cache Miss| CASS

    %% Data Layer - Feature Store
    ML -->|6c. Feature Fetch| FEATURE[(Feature Store
Tecton)]

    %% Event Streaming Pipeline
    AS -->|8. Log Events| KAFKA[Kafka
100K events/sec]
    KAFKA -->|9a. Stream Events| FLINK[Flink
Stream Processing]
    FLINK -->|10a. Update Cache| REDIS
    FLINK -->|10b. Archive| S3[(S3/HDFS
Data Lake)]

    %% Batch Processing Pipeline
    S3 -->|9b. Daily Batch| SPARK[Spark
Batch Processing]
    SPARK -->|10c. Update Features| FEATURE

    %% ML Training Pipeline
    S3 --> AIRFLOW[Airflow
Orchestration]
    AIRFLOW -->|11. Schedule| TRAIN[Training Cluster
Daily CTR Model]
    TRAIN -->|12. Store Model| REGISTRY[Model Registry
Versioning]
    REGISTRY -->|13. Deploy| ML

    %% Observability
    AS -.->|Metrics| PROM[Prometheus
Metrics]
    AS -.->|Traces| JAEGER[Jaeger
Tracing]
    PROM -->|Visualize| GRAF[Grafana
Dashboards]

    classDef client fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    classDef edge fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    classDef service fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef data fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    classDef stream fill:#ffe0b2,stroke:#e65100,stroke-width:2px
    classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px

    class CLIENT client
    class CDN,GLB edge
    class GW,AS,UP,AD_SEL,ML,RTB service
    class REDIS,CASS,FEATURE data
    class KAFKA,FLINK,SPARK,S3,AIRFLOW,TRAIN,REGISTRY stream
    class EXTERNAL external
    class PROM,JAEGER,GRAF stream

Diagram explanation - Request Flow:

Steps 1-4 (Client → Service Layer):

Creative Assets: Client fetches ad images/videos from CDN (separate from ad selection)
Ad Request: Client sends request to Global Load Balancer
Route to Region: GLB uses GeoDNS to route to nearest healthy region
Authentication: Kong gateway authenticates (JWT), rate limits, and forwards to Ad Server

Steps 5a-5d (Parallel Service Calls): The Ad Server Orchestrator makes 4 parallel calls with independent latency budgets:

5a. User Profile (10ms): Fetch user demographics, interests, browsing history
5b. Ad Selection (15ms): Retrieve candidate ads matching user’s profile
5c. ML Inference (40ms): Predict CTR for each candidate ad (critical path bottleneck)
5d. RTB Auction (30ms): Send bid requests to 50+ external DSPs via OpenRTB protocol

Steps 6a-6d (Data Access - Cache First): Each service first checks cache before hitting persistent storage:

6a-6b. Redis Cache: User Profile and Ad Selection check Redis (5ms p99)
6c. Feature Store: ML fetches pre-computed features from Tecton (10ms p99)
6d. External DSPs: RTB waits max 30ms for external bidder responses

Steps 7a-7b (Cache Miss → Database): On cache miss, services read from Cassandra (20ms p99). This happens for ~5% of requests with 95% cache hit rate.

Steps 8-10 (Data Pipeline - Post-Request): After serving the ad (asynchronously):

8. Log Events: Ad Server publishes impression/click events to Kafka
9a. Stream Processing: Flink computes real-time aggregations (last hour CTR)
9b. Batch Processing: Spark processes historical data from S3 daily
10a-10c. Feature Updates: Both pipelines update feature stores (Redis + Tecton)

Steps 11-13 (ML Training Loop - Daily):

11. Schedule: Airflow triggers daily model retraining
12. Store Model: Trained models versioned in registry
13. Deploy: New models gradually deployed to inference service (canary rollout)

Observability (Continuous):

Ad Server emits metrics (latency, error rate) to Prometheus
Distributed traces sent to Jaeger for debugging
Grafana visualizes both for dashboards and alerts

Latency Budget Decomposition

For 100ms total budget:

$$T_{total} = T_{network} + T_{gateway} + T_{services} + T_{serialization}$$

Network (10ms):

Client to edge: 5ms
Edge to service: 5ms

API Gateway (5ms):

Authentication: 2ms
Rate limiting: 1ms
Request enrichment: 2ms

Service Layer (75ms): Parallel execution bounded by slowest service:

User Profile: 10ms
Ad Selection: 15ms
ML Inference: 40ms ← bottleneck
RTB Auction: 30ms

With parallelization: 40ms (ML inference)

Remaining:

Auction logic: 5ms
Serialization: 5ms

Total: 65ms with 35ms headroom for variance.

Technology Stack Decisions

API Gateway Selection

For a 5ms latency budget, the API gateway choice is critical. Here’s my detailed analysis:

Technology	Latency	Throughput	Rate Limiting	Auth Methods	Ops Complexity
Kong	3-5ms	100K/node	Plugin-based	JWT, OAuth2, LDAP	Medium
AWS API Gateway	5-10ms	10K/endpoint	Built-in	IAM, Cognito, Lambda	Low (managed)
NGINX Plus	1-3ms	200K/node	Lua scripting	Custom modules	High
Envoy	2-4ms	150K/node	Extension filters	External auth	High

Decision: Kong

Why Kong won:

Plugin ecosystem: Rate limiting, auth, transformations work out of the box
Latency overhead: 3-5ms fits comfortably within 5ms budget
Cost efficiency: Self-hosted avoids per-request AWS charges
Developer experience: Declarative config and OpenAPI integration

The breakdown within my 5ms budget:

TLS termination: ~1ms
Authentication (JWT verify): ~2ms
Rate limiting (Redis lookup): ~1ms
Request routing: ~1ms
Total: 5ms - tight but workable

Why I was tempted by NGINX Plus:

That 1-3ms latency is genuinely attractive - best in class
200K RPS/node is impressive
But writing custom Lua for complex auth would add weeks to development
Finding engineers who know both Lua and auth patterns is harder than it should be

Why I ruled out AWS API Gateway:

At 1M QPS (~86B requests/day), AWS’s per-request pricing becomes problematic:

Rough cost: At high sustained throughput, can reach hundreds of thousands per month
The 5-10ms latency overhead is also problematic when I need 2ms for auth
Great for serverless/event-driven workloads, but not for sustained high throughput

Cost comparison (approximate, varies by region/contract):

Kong self-hosted:

~10 nodes with appropriate sizing
Enterprise license if needed
Ballpark: ~$5-15K/month depending on configuration

AWS API Gateway:

Per-request pricing at billions of requests/month
Ballpark: Could be 10-30× more expensive than self-hosted at this scale

The cost difference at high sustained throughput is significant - potentially enough to fund multiple engineer salaries.

Event Streaming Platform Selection

Before choosing stream processing frameworks, I need the right event streaming backbone. This is a foundational decision:

Technology	Throughput/Partition	Latency (p99)	Durability	Ordering	Scalability
Kafka	100MB/sec	5-15ms	Disk-based replication	Per-partition	Horizontal (add brokers/partitions)
Pulsar	80MB/sec	10-20ms	BookKeeper (distributed log)	Per-partition	Horizontal (separate compute/storage)
RabbitMQ	20MB/sec	5-10ms	Optional persistence	Per-queue	Vertical (limited)
AWS Kinesis	1MB/sec/shard	200-500ms	S3-backed	Per-shard	Manual shard management

Decision: Kafka

Here’s my reasoning:

Throughput: 100MB/sec per partition handles peak load (100K events/sec × 1KB/event = 100MB/sec)
Latency: 5-15ms p99 leaves headroom in my 100ms feature freshness budget
Durability: Disk-based replication (RF=3) means no data loss when brokers fail
Ecosystem: Everything works with Kafka - Flink, Spark, Kafka Connect. “Just works” is generous (you’ll tune GC and debug consumer rebalancing), but compared to alternatives, it’s solid
Ordering: Per-partition ordering is critical for event causality (can’t process a click before the impression)

Partitioning strategy:

For 100K events/sec across 100 partitions: 1,000 events/sec per partition

Partition key: user_id % 100 ensures:

All events for a user go to same partition (maintains ordering)
Balanced distribution (assuming uniform user distribution)

Pulsar consideration:

Pulsar’s architecture is genuinely elegant with storage/compute separation, but:

The ecosystem maturity gap is real
Separate BookKeeper layer adds operational complexity
For my 7-day retention requirement, advanced storage architecture seems unnecessary

Why I didn’t pick RabbitMQ:

I like RabbitMQ for certain use cases, but:

20MB/sec throughput ceiling is way below what I need (100MB/sec)
Scaling is painful - mostly vertical, horizontal gets messy
Built for task queues, not event streaming

Cost comparison (approximate, varies by region/contract):

Kafka self-hosted:

Several brokers with appropriate compute/storage
NVMe SSDs for performance
ZooKeeper cluster for coordination
Ballpark: ~$3-5K/month for this workload

AWS Kinesis alternative:

Per-shard hourly fees
Per-PUT request fees at billions of PUTs per month
Ballpark: Could be 20-50× more expensive than self-hosted Kafka

The cost difference is substantial at high sustained throughput. Kinesis might make sense for bursty or low-volume workloads where operational simplicity matters more than cost.

Stream Processing: Flink

<100ms latency for windowed aggregations
True streaming (not micro-batches)
Exactly-once semantics critical for financial data
Distributed snapshots for stateful processing

Batch Processing: Spark

In-memory processing for feature engineering
MLlib for statistical aggregations
SQL interface for data scientists

Feature Store: Tecton

<10ms p99 serving latency
100ms feature freshness
Managed service reduces operational burden
Trade-off: Vendor lock-in vs engineering time saved

L2 Cache: Redis Cluster

5ms p99 latency
Atomic operations for budget counters
Complex data structures (sorted sets, hashes)
Persistence for crash recovery
Trade-off: 30% higher memory vs Memcached for operational features

Persistent Store: Cassandra

Linear scalability to 4TB+ user profiles
Multi-DC replication built-in
Tunable consistency (CL=QUORUM for balance)
At 8B requests/day, self-hosted is 6× cheaper than DynamoDB

Container Orchestration Selection

Container orchestration is critical for handling GPU scheduling and auto-scaling. Here’s my comparison:

Technology	Learning Curve	Ecosystem	Auto-scaling	Multi-cloud	Networking
Kubernetes	Steep	Massive (CNCF)	HPA, VPA, Cluster Autoscaler	Yes (portable)	Advanced (CNI, Service Mesh)
AWS ECS	Medium	AWS-native	Target tracking, step scaling	No (AWS-only)	AWS VPC
Docker Swarm	Easy	Limited	Basic (replicas)	Yes (portable)	Overlay networking
Nomad	Medium	HashiCorp ecosystem	Auto-scaling plugins	Yes (portable)	Consul integration

Decision: Kubernetes

Look, Kubernetes is the obvious choice, but let me explain why this isn’t just cargo-culting:

Ecosystem: 78% adoption means you can hire people who know it. Try hiring for Docker Swarm or Nomad
Auto-scaling: HPA based on custom metrics (like inference queue depth) is exactly what I need
GPU support: Native GPU scheduling with node affinity - critical for ML workloads
Service mesh: Istio/Linkerd integration gives circuit breaking and traffic splitting
Portability: Deploy to AWS, GCP, Azure, or on-prem without rewriting

Unpopular opinion: Kubernetes is overly complex for 80% of use cases, but if you need GPU orchestration and multi-cloud portability, you’re kind of stuck with it.

Kubernetes-specific features critical for ads platform:

1. Horizontal Pod Autoscaler (HPA):

Scale ML inference pods based on custom metrics (inference queue depth)
Formula: (\text{desired\_replicas} = \lceil \text{current} \times \frac{\text{current\_metric}}{\text{target\_metric}} \rceil)
Example: Target 80% CPU, current 90% across 10 pods → (\lceil 10 \times \frac{90}{80} \rceil = 12) pods

2. GPU Node Affinity:

Schedule ML inference pods only on GPU nodes using node selectors
Prevents GPU resource waste by isolating GPU workloads

3. Service Mesh with Sidecar Pattern (Istio/Envoy):

The sidecar pattern deploys a lightweight proxy (Envoy) alongside each service pod, intercepting all network traffic for observability, security, and resilience—without modifying application code.

    
    graph TD
    subgraph POD1["Pod: Ad Server (Zone A)"]
        APP1[Ad Server
Application]
        SIDECAR1[Envoy Sidecar
50-100MB]
        APP1 -->|"1. localhost:15001
0.1ms"| SIDECAR1
    end

    subgraph POD2["Pod: ML Inference (Zone A)"]
        SIDECAR2[Envoy Sidecar
Connection Pool]
        APP2[ML Service
Application]
        SIDECAR2 -->|"5. localhost
0.1ms"| APP2
    end

    subgraph POD3["Pod: ML Inference (Zone B)"]
        SIDECAR3[Envoy Sidecar
Connection Pool]
        APP3[ML Service
Application]
        SIDECAR3 -->|localhost| APP3
    end

    subgraph POD4["Pod: User Profile (Zone A)"]
        SIDECAR4[Envoy Sidecar]
        APP4[User Profile
Application]
        SIDECAR4 -->|localhost| APP4
    end

    SIDECAR1 -->|"2. HTTP/2 pooled
same-zone: 1ms"| SIDECAR2
    SIDECAR1 -.->|"cross-zone: 3ms
(avoided by locality)"| SIDECAR3
    SIDECAR1 -->|"3. HTTP/2 pooled
1ms"| SIDECAR4

    SIDECAR2 -->|"4. Response
aggregated"| SIDECAR1

    CONTROL[Istio Control Plane
istiod]
    CONTROL -.->|"Config: routes,
policies, certs"| SIDECAR1
    CONTROL -.->|Config| SIDECAR2
    CONTROL -.->|Config| SIDECAR3
    CONTROL -.->|Config| SIDECAR4

    SIDECAR1 -.->|"Telemetry:
traces, metrics"| OBSERV[Prometheus
Jaeger]

    classDef pod fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef sidecar fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    classDef app fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    classDef control fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px

    class POD1,POD2,POD3,POD4 pod
    class SIDECAR1,SIDECAR2,SIDECAR3,SIDECAR4 sidecar
    class APP1,APP2,APP3,APP4 app
    class CONTROL,OBSERV control

Diagram explanation - Sidecar Network Flow:

Application to Local Sidecar (0.1ms): Ad Server makes request to localhost:15001, hitting local Envoy sidecar via loopback interface—no network latency
Sidecar to Sidecar (1ms same-zone): Envoy uses persistent HTTP/2 connection pool to downstream service’s sidecar. Connection already established, no TCP handshake.
Parallel Requests: Single HTTP/2 connection multiplexes multiple requests (ML inference + User Profile fetch) simultaneously
Response Path: Responses return through same pooled connections
Sidecar to Application (0.1ms): Downstream sidecar forwards to local application via localhost

Without Sidecar (Direct Service Calls):

Each request = new TCP connection (3-way handshake: 1-3ms)
No locality awareness = random cross-zone calls (3ms vs 1ms)
Manual retry logic in every service
Total: ~5-8ms per request

With Sidecar:

Connection pooling eliminates handshake overhead
Locality-aware routing prefers same-zone (1ms)
Automatic retries with circuit breaking
Total: ~2-4ms per request (40-50% latency reduction)

Network Optimization Benefits:

Connection pooling: Sidecars maintain persistent connections to downstream services, avoiding TCP handshake overhead (3-way handshake = 1-3ms per request). Application makes local calls to sidecar (0.1ms), sidecar reuses pooled connections.
Protocol optimization: HTTP/2 multiplexing allows 100+ concurrent requests over single TCP connection, reducing connection overhead from 50+ connections to 2-3 per pod.
Intelligent load balancing: Client-side load balancing with locality-aware routing (prefer same-zone pods, saving 1-2ms cross-AZ latency). Sidecar tracks real-time endpoint health, removing failed pods from rotation in <1s vs DNS TTL delays (30-60s).
Retry budgets: Automatic retries with exponential backoff, but capped at 10% additional load to prevent cascading failures. Configurable per-route retry policies.

Operational Benefits:

Traffic splitting: A/B test new ML model versions (90% traffic to v1, 10% to v2) with header-based routing
Circuit breaking: Automatic failure detection with configurable thresholds (5xx rate >5% → open circuit, fast-fail requests)
Observability: Automatic request tracing, RED metrics (Rate, Errors, Duration) per service without instrumentation
Security: Mutual TLS between all services with automatic certificate rotation (SPIFFE/SPIRE identity)

Latency overhead: Sidecar adds 1-2ms per hop (network stack traversal: app → sidecar → network), but connection pooling and retry logic often save 3-5ms, resulting in net latency improvement of 1-3ms for typical request patterns.

Resource cost: Each Envoy sidecar consumes 50-100MB memory + 0.1-0.2 CPU cores. For 500 pods, this is 25-50GB RAM + 50-100 cores—significant but justified by operational benefits (automatic mTLS, retries, observability) that would otherwise require custom code in every service.

Why not AWS ECS?

ECS is tempting because it’s managed and cheaper:

Vendor lock-in is real - can’t migrate to GCP/Azure without rewriting
Auto-scaling limited to CPU/memory target tracking - no custom metrics
GPU support is janky compared to Kubernetes
Works great for simple microservices, but for complex ML infrastructure? Not so much

Why not Docker Swarm?

I wish Swarm had succeeded. It’s so much simpler than Kubernetes. But:

Ecosystem is basically dead - 5% market share, stagnant development
No GPU scheduling, basic auto-scaling, no service mesh
Finding engineers who know Swarm is nearly impossible
Docker Inc. has basically abandoned it in favor of Kubernetes

Cost trade-off (rough comparison for ~100 nodes):

Kubernetes (managed EKS):

Control plane fees (managed)
Worker node infrastructure
Operational overhead (engineering time)
Rough total: Varies widely depending on configuration

AWS ECS (Fargate):

Per-vCPU and per-GB-memory pricing
No control plane fees
Lower operational overhead
Generally 10-20% cheaper than Kubernetes for basic workloads

So why choose Kubernetes despite slightly higher costs?

GPU support and multi-cloud portability matter for this use case. ECS Fargate has limited GPU support, and I prefer not being locked into AWS. The premium (perhaps 10-20% higher monthly costs) acts as insurance against vendor lock-in and provides proper GPU scheduling for ML workloads.

Model Serving: TensorFlow Serving

30-40ms p99 latency
Auto-batching for GPU efficiency
gRPC support (5ms vs 15ms for REST)
Model versioning for A/B testing

Observability Stack:

Metrics: Prometheus + Grafana (self-hosted 5-10× cheaper than managed at scale)
Tracing: Jaeger with Cassandra backend (adaptive sampling)
Logs: ELK Stack for full-text search (Loki for high-volume app logs)

Deployment Strategy: Kubernetes with Autoscaling

GPU instances: 3 min baseline, 10 max (50% cost vs always-on)
90-second warmup acceptable for burst traffic

Critical Path Analysis

    
    graph TD
    A[Request Arrives
t=0ms]

    A -->|5ms: Auth + Rate Limit| B[Gateway
t=5ms]

    B --> FORK{Parallel Fork}

    FORK -->|10ms budget| C[User Profile
Cache: 5ms
DB miss: 20ms]
    FORK -->|15ms budget| D[Ad Selection
Filter 1M ads → 100]
    FORK -->|40ms budget| E[ML Inference
BOTTLENECK
Feature: 10ms
Model: 30ms]
    FORK -->|30ms budget| F[RTB Auction
50 DSPs parallel]

    C -->|Completes t=15ms| G[Join Point
Wait for all
t=45ms]
    D -->|Completes t=20ms| G
    E -->|Completes t=45ms| G
    F -->|Completes t=35ms| G

    G -->|5ms: Rank by eCPM
Select winner| H[Auction Logic
t=50ms]

    H -->|5ms: Serialize JSON
Return to client| I[Response Sent
t=55ms]

    classDef normal fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef bottleneck fill:#ffebee,stroke:#d32f2f,stroke-width:3px
    classDef gateway fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef join fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    classDef fork fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px

    class A,B,H,I gateway
    class C,D,F normal
    class E bottleneck
    class G join
    class FORK fork

Diagram explanation - Critical Path Timing:

Request Arrives (t=0ms): User opens app, triggers ad request to Global Load Balancer

Gateway (t=0→5ms):

JWT token validation: 2ms
Rate limiting check (Redis lookup): 1ms
Request routing to Ad Server: 2ms
Total: 5ms

Parallel Service Calls (t=5ms → t=45ms): Ad Server forks 4 parallel requests. Total latency bounded by slowest service (not sum):

User Profile (10ms target, finishes t=15ms):
- Redis cache hit (95%): 5ms
- Cassandra on miss (5%): 20ms
- Weighted average: 0.95×5 + 0.05×20 = 5.75ms
Ad Selection (15ms target, finishes t=20ms):
- Retrieve ~1M candidate ads from Cassandra
- Apply filters (user interests, ad categories): 5ms
- Reduce to top 100 candidates: 10ms
- Total: 15ms
ML Inference (40ms target, finishes t=45ms) - CRITICAL PATH:
- Feature fetching: 10ms (from Tecton + Redis)
  - Real-time features (time, device): 0ms
  - Near-RT features (last hour CTR): Redis 5ms
  - Batch features (user segments): Tecton 5ms
- Model inference: 30ms (TensorFlow Serving on GPU)
  - Load 150-dim feature vector
  - GBDT forward pass through 100 trees
  - Return CTR prediction for each of 100 ad candidates
- Total: 40ms ← This determines overall latency
RTB Auction (30ms target, finishes t=35ms):
- Send OpenRTB requests to 50 DSPs in parallel
- Wait for responses (hard 30ms timeout)
- Collect ~40 valid bids (10 DSPs timeout)
- Total: 30ms

Join Point (t=45ms): Ad Server waits for all 4 parallel calls to complete. Since ML Inference takes longest (40ms), join completes at t=45ms. The other services finish earlier but must wait.

Auction Logic (t=45ms → t=50ms):

Compute eCPM for all candidates: bid × CTR × 1000
Sort by eCPM descending
Select highest eCPM as winner
Calculate second-price: 2nd highest eCPM / (winner CTR × 1000)
Total: 5ms

Response (t=50ms → t=55ms):

Serialize winner ad to JSON
Send HTTP response to client
Total: 5ms

Total end-to-end latency: 55ms (45ms headroom within 100ms SLA)

Why ML Inference is the bottleneck: Even though ML has the largest budget (40ms), it’s still the critical path because:

Feature fetching has inherent network latency (10ms)
Model inference requires GPU computation (30ms)
Other services finish faster: User Profile (10ms), Ad Selection (15ms), RTB (30ms)

Optimization priority:

Reduce ML feature fetch: Pre-warm cache, optimize feature store queries
Accelerate model inference: Model quantization (FP32→INT8), batch optimization
Parallelize within ML: Fetch features while loading model weights

Any improvement to ML Inference directly improves overall latency. Optimizing other services (e.g., User Profile 10ms→5ms) has no impact since they’re not on critical path.

Circuit Breaker Pattern

Prevent cascading failures with three states:

CLOSED: Normal operation, monitor error rate
OPEN: Error rate >10% for 1min → block requests, serve from cache
HALF-OPEN: After exponential backoff, send 1% test traffic

State transitions: $$\text{If } E(t) > 0.10 \text{ for } \Delta t \geq 60s, \text{ trip circuit}$$

$$T_{backoff} = T_{base} \times 2^{n}, \quad T_{base} = 30s$$

Part 3: Real-Time Bidding Integration

OpenRTB Protocol

    
    sequenceDiagram
    participant AS as Ad Server
    participant DSP1 as DSP #1
    participant DSP2 as DSP #2-50
    participant Auction as Auction Logic

    Note over AS,Auction: 100ms Total Budget

    AS->>AS: 1. Construct BidRequest
OpenRTB 2.x format
User context + Ad slot

    par Parallel DSP Calls (30ms timeout each)
        AS->>DSP1: 2a. HTTP POST /bid
OpenRTB BidRequest
        activate DSP1
        Note over DSP1: DSP evaluates user
Checks budget
Runs own auction
        DSP1-->>AS: 3a. BidResponse
Price: $5.50
Creative ID
        deactivate DSP1
    and
        AS->>DSP2: 2b. Broadcast to 50 DSPs
Parallel HTTP/2 connections
        activate DSP2
        Note over DSP2: Each DSP responds
independently
Some may timeout
        DSP2-->>AS: 3b. Multiple BidResponses
[$3.20, $4.80, ...]
        deactivate DSP2
    end

    Note over AS: 4. Timeout enforcement (30ms):
Discard late responses
Collected ~40 bids from 50 DSPs

    AS->>Auction: 5. Collected bids +
ML CTR predictions
    activate Auction
    Note over Auction: 6. Compute eCPM for each bid:
eCPM = bid × CTR × 1000
Sort by eCPM descending
    Auction->>Auction: 7. Run GSP Auction
Winner pays 2nd price
    Note over Auction: 8. Calculate price:
p = eCPM₂ / (CTR_winner × 1000)
    Auction-->>AS: 9. Winner + Price
DSP #1, $3.33
    deactivate Auction

    AS-->>DSP1: 10. Win notification
(async, best-effort)
Creative URL

    Note over AS,Auction: Total elapsed: ~35ms
Within 100ms budget

    rect rgb(255, 240, 240)
        Note over AS,DSP2: Failure handling:
- 10 DSPs timeout (discarded)
- Circuit breaker trips if >50% fail
- Fallback to guaranteed inventory
    end

Diagram explanation - RTB Auction Flow:

Step 1 (Construct BidRequest): Ad Server builds OpenRTB 2.x compliant BidRequest containing:

User demographics (age, gender, interests) - privacy-compliant hashed IDs
Device info (iOS/Android, screen size)
Ad slot specifications (320×50 banner, minimum bid floor $2.50)
Maximum response time (tmax=30ms)

Steps 2a-2b (Parallel Broadcast to DSPs): Ad Server simultaneously sends BidRequests to 50+ DSPs using:

HTTP/2 multiplexing: Single persistent connection per DSP, multiple concurrent requests
Connection pooling: Reuse TCP connections to avoid handshake overhead (saves ~50ms per cold start)
Geographic distribution: Regional RTB services in US-East, EU, APAC to minimize network latency

Steps 3a-3b (DSP Responses): Each DSP independently:

Evaluates user value based on their own ML models
Checks advertiser budgets and campaign constraints
Runs internal auction across multiple advertisers
Returns highest bid + creative ID (or no-bid if user doesn’t match)

Typical response rate: 80% of DSPs respond within 30ms, 20% timeout

Step 4 (Timeout Enforcement): After exactly 30ms, Ad Server:

Closes bidding window (hard timeout)
Discards any late responses (prevents auction manipulation)
Typically collects 40-45 valid bids from 50 DSPs

Step 5 (Combine with ML CTR): Ad Server merges:

External bids from DSPs (price offered)
Internal ML CTR predictions (likelihood of click)
This combination enables revenue optimization: bid × CTR = expected revenue

Steps 6-8 (GSP Auction): Auction logic executes Generalized Second-Price auction:

Compute eCPM: For each bid, calculate effective CPM = bid × CTR × 1000
Sort: Rank all bids by eCPM descending
Select winner: Highest eCPM wins the impression
Calculate price: Winner pays minimum needed to beat 2nd place (second-price auction)
- Example: Winner bid $5.50 with CTR 0.15 (eCPM $825)
- Second place eCPM $600
- Winner pays: $600 / (0.15 × 1000) = $4.00 (less than their $5.50 bid)

Step 9 (Return Winner): Auction returns winning DSP ID and computed price to Ad Server

Step 10 (Win Notification): Ad Server sends async win notification to winning DSP (best-effort delivery):

Creative URL to render
Final price charged
Impression ID for tracking

Failure Handling (bottom note):

Timeouts (20%): 10 DSPs out of 50 don’t respond in time → proceed with 40 bids
Circuit breaker: If >50% of DSPs fail for 1 minute, trip circuit and fallback to guaranteed inventory
Fallback: Serve house ads or guaranteed campaigns to avoid blank space

BidRequest (simplified):

{
  "id": "req_12345",
  "imp": [{
    "id": "1",
    "banner": {"w": 320, "h": 50},
    "bidfloor": 2.50
  }],
  "user": {
    "id": "hashed_user_id",
    "geo": {"country": "USA"}
  },
  "tmax": 30
}

Timeout Handling: Adaptive Strategy

Maintain per-DSP latency histograms (H_{dsp}). Set per-DSP timeout: $$T_{dsp} = \text{min}\left(P_{95}(H_{dsp}), 30ms\right)$$

Progressive Auction:

Preliminary auction at 20ms with available bids
Update with late arrivals up to 30ms if they beat current winner
Advantage: Low latency for fast DSPs, opportunity for high-value slow bids

HTTP/2 Connection Pooling

Connection pool sizing: $$P = \frac{Q \times L}{N}$$

Example: 1000 QPS, 30ms latency, 10 servers → 3 connections/server

HTTP/2 benefits:

Single connection, multiple concurrent requests
HPACK compression reduces overhead ~70%
Saves ~50ms cold start time per request

Geographic Distribution

Network latency bounded by speed of light: $$T_{propagation} \geq \frac{d}{c \times 0.67}$$

Example: NY to London (5,585 km) ≈ 28ms propagation - nearly entire RTB budget!

Solution: Regional deployment in US-East, US-West, EU, APAC reduces max distance from 10,000km to ~1,000km: $$T_{propagation} \approx 5ms$$

Savings: 23ms enables including 10+ additional DSPs within budget.

Part 4: ML Inference Pipeline

Feature Engineering Architecture

Features fall into three categories with different freshness requirements:

Real-Time (computed per request):

User context: time, location, device
Session features: current browsing, last N actions

Near-Real-Time (pre-computed, 10s TTL):

User interests: aggregated from last 24h activity
Ad performance: click rates (last hour)

Batch (pre-computed daily):

User segments: demographic clusters
Long-term CTR: 30-day aggregated performance

The feature engineering architecture supports three distinct freshness tiers, each optimized for different latency and accuracy requirements:

    
    graph TB
    subgraph "Real-Time Feature Pipeline"
        REQ[Ad Request] --> PARSE[Request Parser]
        PARSE --> CONTEXT[Context Features
time, location, device
Latency: 5ms]
        PARSE --> SESSION[Session Features
user actions
Latency: 10ms]
    end

    subgraph "Feature Store"
        CONTEXT --> MERGE[Feature Vector Assembly]
        SESSION --> MERGE

        REDIS_RT[(Redis
Near-RT Features
TTL: 10s)] --> MERGE
        REDIS_BATCH[(Redis
Batch Features
TTL: 24h)] --> MERGE
    end

    subgraph "Stream Processing Pipeline"
        EVENTS[User Events
clicks, views] --> KAFKA[Kafka
100K events/sec]
        KAFKA --> FLINK[Flink
Windowed Aggregation
last hour CTR]
        FLINK --> REDIS_RT
    end

    subgraph "Batch Processing Pipeline"
        S3[S3 Data Lake
Historical Data] --> SPARK[Spark Jobs
Daily]
        SPARK --> FEATURE_GEN[Feature Generation
User segments, 30-day CTR]
        FEATURE_GEN --> REDIS_BATCH
    end

    MERGE --> INFERENCE[ML Inference
TensorFlow Serving
Latency: 40ms]
    INFERENCE --> PREDICTION[CTR Prediction
0.0 - 1.0]

    classDef realtime fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef store fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    classDef stream fill:#ffe0b2,stroke:#e65100,stroke-width:2px
    classDef batch fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef inference fill:#fff9c4,stroke:#fbc02d,stroke-width:3px

    class REQ,PARSE,CONTEXT,SESSION realtime
    class REDIS_RT,REDIS_BATCH,MERGE store
    class EVENTS,KAFKA,FLINK stream
    class S3,SPARK,FEATURE_GEN batch
    class INFERENCE,PREDICTION inference

Diagram explanation:

Real-Time Feature Pipeline (top-left): When an ad request arrives, the Request Parser extracts immediate context (time of day, device type, location) in ~5ms and session features (recent browsing history) in ~10ms. These features have zero staleness but limited richness.

Stream Processing Pipeline (bottom-left): User events (clicks, views) flow through Kafka at 100K events/sec to Flink, which computes windowed aggregations like “user’s CTR in the last hour.” These features update every 10 seconds (TTL) in Redis, balancing freshness with computational cost.

Batch Processing Pipeline (bottom-center): Daily Spark jobs process historical data from S3 to generate complex features like user demographic segments and 30-day CTR averages. These features update daily (TTL: 24h) but provide deep insights not possible in real-time.

Feature Store (center): The Feature Vector Assembly component merges all three tiers - real-time context, near-real-time behavioral signals, and batch-computed segments - into a single 150-dimensional vector. This assembled vector feeds ML Inference.

ML Inference (right): TensorFlow Serving receives the 150-dim feature vector and predicts CTR in <40ms, the critical path bottleneck in our system.

Feature Freshness Guarantees:

Batch features: (t_{fresh} \leq 24h)
Stream features: (t_{fresh} \leq 100ms)
Real-time features: (t_{fresh} = 0) (computed per request)

Latency SLA: $$P(\text{FeatureLookup} \leq 10ms) \geq 0.99$$

Achieved with:

Redis p99 latency: 5ms
Feature vector assembly: 2ms

Feature Vector Construction

$$x = [x_{user}, x_{ad}, x_{context}, x_{cross}]$$

User features (50-dim): demographics, interests, historical CTR
Ad features (30-dim): creative type, category, global CTR
Context features (20-dim): time, device, placement
Cross features (50-dim): user-ad interactions, interest alignment

Total: 150 features

Model Architecture Decision

Technology Selection: ML Model Architecture

For CTR prediction within a 40ms latency budget, the model family choice balances accuracy, inference speed, and operational complexity:

Model Family	Inference Latency	Infrastructure	Accuracy	Operational Complexity
Tree-based (GBDT)	5-15ms (CPU)	Standard compute	High (0.78-0.82 AUC)	Low - feature importance aids debugging
Deep Learning (DNN)	20-40ms (GPU)	GPU clusters	Highest (0.80-0.84 AUC)	High - black box, slow iteration
Linear (FM/FFM)	3-8ms (CPU)	Standard compute	Moderate (0.75-0.78 AUC)	Low - interpretable weights

Selection rationale: Tree-based models (LightGBM/XGBoost) offer the best trade-off—meeting latency requirements on CPU infrastructure while delivering strong accuracy. Deep learning provides 2-3% AUC gains but requires GPU infrastructure (50-100× cost increase at 1M+ QPS) and exceeds our latency budget. Linear models are fastest but sacrifice too much accuracy for the marginal latency improvement.

Framework Comparison for Production Serving:

Framework	Latency (p99)	Throughput	GPU Support	Deployment	Best For
TensorFlow Serving	15-25ms	High (batching)	Excellent	gRPC/REST	DNN models, GPU inference
ONNX Runtime	8-15ms	Very High	Good	Embedded/REST	Cross-framework, CPU optimization
Triton Inference Server	10-20ms	Highest	Excellent	Multi-model batching	Heterogeneous GPU workloads
Native (LightGBM/XGBoost)	5-12ms	High	Limited	Custom service	Tree models, lowest latency

Deployment approach: For GBDT models, native LightGBM serving wrapped in a lightweight gRPC service provides optimal latency. For future DNN models, TensorFlow Serving offers production-grade features (model versioning, A/B testing, request batching) with acceptable overhead. ONNX Runtime serves as a middle ground, offering hardware acceleration across CPU/GPU with framework portability.

Scaling Strategy:

Horizontal autoscaling based on request throughput using Kubernetes HPA. Target utilization: 70-80% to handle traffic spikes. GPU workloads require careful node selection to avoid scheduling on CPU-only nodes. Request batching (accumulate 5-10ms, batch size 32-64) amortizes overhead and improves GPU utilization by 3-5×.

Optimization techniques: Model quantization (FP32 → INT8) reduces memory footprint by 4× and improves throughput by 2-3× with minimal accuracy loss (<1% AUC degradation). Critical for cost efficiency at scale.

Part 5: Distributed Caching

Multi-Tier Cache Technology Selection

To achieve 95%+ cache hit rate with sub-10ms latency, I need three cache tiers with different characteristics:

L1 Cache (In-Process) Options:

Technology	Latency	Throughput	Memory	Pros	Cons
Caffeine (JVM)	~1μs	10M ops/sec	In-heap	Window TinyLFU eviction, lock-free reads	JVM-only, GC pressure
Guava Cache	~2μs	5M ops/sec	In-heap	Simple API, widely used	LRU only, lower hit rate
Ehcache	~1.5μs	8M ops/sec	In/off-heap	Off-heap option reduces GC	More complex configuration

Decision: Caffeine - Superior eviction algorithm (Window TinyLFU) yields 10-15% higher hit rates than LRU-based alternatives. Benchmarks show 2× throughput vs Guava.

L2 Cache (Distributed) Options:

Technology	Latency (p99)	Throughput	Clustering	Data Structures	Pros	Cons
Redis Cluster	5ms	100K ops/sec/node	Native sharding	Rich (lists, sets, sorted sets)	Lua scripting, atomic ops	More memory than Memcached
Memcached	3ms	150K ops/sec/node	Client-side sharding	Key-value only	Lower memory, simpler	No atomic ops, no persistence
Hazelcast	8ms	50K ops/sec/node	Native clustering	Rich data structures	Java integration	Higher latency, less mature

Decision: Redis Cluster

Need atomic operations for budget counters (DECRBY, INCRBY)
Complex data structures for ad metadata (sorted sets for recency, hashes for attributes)
Persistence for crash recovery (avoid cold cache startup)
Trade-off accepted: 30% higher memory usage vs Memcached for operational simplicity

Memcached typically costs ~30% less than Redis for equivalent capacity, but Redis offers atomic operations and richer data structures that justify the premium for this use case.

L3 Persistent Store Options:

Technology	Read Latency (p99)	Write Throughput	Scalability	Consistency	Pros	Cons
Cassandra	20ms	500K writes/sec	Linear (peer-to-peer)	Tunable (CL=QUORUM)	Multi-DC, no SPOF	No JOINs, eventual consistency
PostgreSQL	15ms	50K writes/sec	Vertical + sharding	ACID transactions	SQL, JOINs, strong consistency	Manual sharding complex
MongoDB	18ms	200K writes/sec	Horizontal sharding	Tunable	Flexible schema	Less mature than Cassandra
DynamoDB	10ms	1M writes/sec	Fully managed	Eventual/strong	Auto-scaling, no ops	Vendor lock-in, cost at scale

Decision: Cassandra

Scale requirement: 400M users → 4TB+ user profiles
Write-heavy: User profile updates, ad impression logs
Multi-region: Built-in multi-datacenter replication (RF=3)
Linear scalability: Add nodes without rebalancing entire cluster

PostgreSQL limitation: Vertical scaling hits ceiling around 50-100TB. Sharding (e.g., Citus) adds operational complexity comparable to Cassandra, but without peer-to-peer resilience.

DynamoDB cost analysis: At 8B requests/day: $$\text{Cost}_{DynamoDB} = \frac{8 \times 10^9}{1000} \times \$1.25 \times 10^{-6} \approx \$10K/day = \$300K/month$$

Cassandra on 100 nodes @ $500/node/month: $50K/month (6× cheaper at this scale).

Three-Tier Cache Hierarchy

    
    graph TB
    REQ[Cache Request
user_id: 12345]

    REQ --> L1[L1: Caffeine JVM Cache
10-second TTL
~1μs lookup
100MB per server]
    L1 --> L1_HIT{Hit?}

    L1_HIT -->|60% Hit| RESP1[Response ~1μs]
    L1_HIT -->|40% Miss| L2[L2: Redis Cluster
30-second TTL
~5ms lookup
1000 nodes × 16GB]

    L2 --> L2_HIT{Hit?}
    L2_HIT -->|35% Hit| POPULATE_L1[Populate L1]
    POPULATE_L1 --> RESP2[Response ~5ms]

    L2_HIT -->|5% Miss| L3[L3: Cassandra Ring
Multi-DC Replication
~20ms read
Petabyte scale]
    L3 --> POPULATE_L2[Populate L2 + L1]
    POPULATE_L2 --> RESP3[Response ~20ms]

    MONITOR[Stream Processor
Flink/Spark
Count-Min Sketch] -.->|0.1% sampling| L2
    MONITOR -.->|Detect hot keys| REPLICATE[Dynamic Replication
3x copies for hot keys]
    REPLICATE -.->|Replicate to nodes| L2

    PERF[Total Hit Rate: 95%
Average Latency: 2.75ms
p99 Latency: 25ms]

    classDef l1cache fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef l2cache fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    classDef l3cache fill:#ffe0b2,stroke:#e65100,stroke-width:2px
    classDef response fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef monitor fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    classDef stats fill:#e0f7fa,stroke:#00838f,stroke-width:3px

    class REQ,RESP1,RESP2,RESP3 response
    class L1,L1_HIT l1cache
    class L2,L2_HIT l2cache
    class L3 l3cache
    class MONITOR,REPLICATE monitor
    class PERF stats
    class POPULATE_L1,POPULATE_L2 response

Diagram explanation:

L1 (In-Process Cache): Caffeine cache stores the hottest 10% of data (100MB per server) with 10-second TTL. Sub-microsecond lookups serve 60% of requests. When L1 misses, the request cascades to L2.

L2 (Distributed Cache): Redis Cluster with 1000 nodes (16GB each) stores 20% of total data with 30-second TTL. Serves 35% of total requests (87.5% of L1 misses) in ~5ms. Consistent hashing distributes keys across nodes.

L3 (Persistent Store): Cassandra provides the source of truth with multi-datacenter replication (RF=3). Handles remaining 5% of requests in ~20ms. On cache miss, data populates both L2 and L1 for subsequent requests.

Hot Key Detection: Stream processor samples 0.1% of traffic using Count-Min Sketch to detect celebrity users generating 100× normal load. Hot keys are dynamically replicated 3× across Redis nodes to prevent single-node saturation.

Overall Performance: 95% total hit rate with 2.75ms average latency, meeting our <10ms SLA.

Hit Rate Calculation: $$H_{total} = H_1 + (1-H_1)H_2 + (1-H_1)(1-H_2)H_3$$ $$= 0.60 + 0.40 \times 0.35 + 0.40 \times 0.65 \times 1.0 = 0.95$$

Average Latency: $$\mathbb{E}[L] = 0.60 \times 0.001 + 0.35 \times 5 + 0.05 \times 20 = 2.75ms$$

Redis Cluster Sharding

Hash slot assignment: $$\text{slot}(k) = \text{CRC16}(k) \mod 16384$$

1000 nodes, 16,384 slots → ~16 slots per node

Load variance with virtual nodes: $$\text{Var}[\text{load}] = \frac{\mu}{n \times v}$$

1000 QPS, 1000 nodes, 16 virtual nodes → standard deviation ≈ 25% of mean

Hot Partition Mitigation: Count-Min Sketch

Why Count-Min Sketch?

Track frequency of millions of user IDs without gigabytes of memory.

Approach	Memory	Accuracy	Use Case
Exact HashMap	O(n) = GBs	100%	Too expensive
Sampling	Minimal	Misses rare keys	Ineffective
Count-Min Sketch	5.4 KB	99% ± 1%	Perfect fit

Configuration: width=272, depth=5

Error bound: $$\text{estimate}(k) \leq \text{true}(k) + \frac{e \times N}{w}$$

For 100M requests/hour: error <30%, sufficient for detecting >1000 req/sec hot keys.

Dynamic Replication: When frequency >100 req/sec:

Replicate key to (R=3) nodes
Update routing table
Client load-balances reads across replicas

Per-replica load: (\lambda_{replica} = \lambda / R = 1000 / 3 ≈ 333) req/sec

Workload Isolation

Prevent batch workloads from interfering with serving:

Cassandra data-center aware replication:

CREATE KEYSPACE user_profiles
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us_east': 2,      // Serving replicas
  'eu_central': 1    // Batch replica
};

Batch writes: CONSISTENCY LOCAL_ONE (EU-Central) Serving reads: CONSISTENCY LOCAL_QUORUM (US-East)

Cost: Dedicate 33% of replicas to batch, but serving latency unaffected.

Cache Invalidation

Hybrid Strategy:

Normal updates: TTL=30s (passive)
Critical updates (GDPR opt-out): Active invalidation via Kafka (15ms propagation)
Version metadata: For tracking update history

Part 6: Auction Mechanisms

Generalized Second-Price (GSP)

Effective bid (eCPM): $$\text{eCPM}_i = b_i \times \text{CTR}_i \times 1000$$

Winner selection: $$w = \arg\max_{i} \text{eCPM}_i$$

Price (second-price): $$p_w = \frac{\text{eCPM}_{2nd}}{\text{CTR}_w \times 1000} + \epsilon$$

Example:

Advertiser	Bid	CTR	eCPM	Rank
A	5.00	0.10	500	2
B	4.00	0.15	600	1 ← Winner
C	6.00	0.05	300	3

Winner B pays: (\frac{500}{0.15 \times 1000} = 3.33) (bid 4.00 but only pays enough to beat A)

Game Theory: GSP vs VCG

GSP is not truthful (proven by counterexample in auction theory literature). Advertisers can increase profit by bidding below true value. However:

Why use GSP?

Higher revenue than VCG in practice
Simpler to explain
Industry standard (Google Ads)
Nash equilibrium exists

Computational Complexity:

GSP: (O(N \log N)) (sort + select)
VCG: (O(N^2 \log N)) (counterfactual allocations)

For 50 DSPs: GSP=282 ops, VCG=14,100 ops

Recommendation: GSP for real-time, VCG for offline optimization.

Reserve Price Optimization

Optimal reserve price (Myerson 1981): $$r^* = \arg\max_r \left[ r \times \left(1 - F(r)\right) \right]$$

Balance: high reserve → more revenue/ad but fewer filled; low reserve → more filled but lower revenue/ad.

For uniform distribution (F(v) = v/v_{max}): $$r^* = \frac{v_{max}}{2}$$

Empirical approach: Use 50th percentile (median) of historical bids as reserve.

Part 7: Advanced Topics

Budget Pacing: Distributed Spend Control

Problem: Advertisers set daily budgets (e.g., $10,000/day). In a distributed system serving 1M QPS, how do we prevent over-delivery without centralizing every spend decision?

Challenge:

Centralized approach (single database tracks spend):

Latency: ~10ms per spend check
Throughput bottleneck: ~100K QPS max
Single point of failure

Solution: Pre-Allocation with Periodic Reconciliation

    
    graph TD
    ADV[Advertiser X
Daily Budget: $10,000]

    ADV --> BUDGET[Budget Controller]

    BUDGET --> REDIS[(Redis
Atomic Counters)]
    BUDGET --> POSTGRES[(PostgreSQL
Audit Log)]

    BUDGET -->|Allocate $50| AS1[Ad Server 1]
    BUDGET -->|Allocate $75| AS2[Ad Server 2]
    BUDGET -->|Allocate $100| AS3[Ad Server 3]

    AS1 -->|Spent: $42
Return: $8| BUDGET
    AS2 -->|Spent: $68
Return: $7| BUDGET
    AS3 -->|Spent: $95
Return: $5| BUDGET

    BUDGET -->|Hourly reconciliation| POSTGRES

    TIMEOUT[Timeout Monitor
5min intervals] -.->|Release stale
allocations| REDIS

    REDIS -->|Budget < 10%| THROTTLE[Dynamic Throttle]
    THROTTLE -.->|Reduce allocation
size $100→$10| BUDGET

    classDef advertiser fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef controller fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef server fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    classDef storage fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    classDef monitor fill:#ffe0b2,stroke:#e65100,stroke-width:2px

    class ADV advertiser
    class BUDGET,THROTTLE controller
    class AS1,AS2,AS3 server
    class REDIS,POSTGRES storage
    class TIMEOUT monitor

Diagram explanation:

Budget Controller (center): Manages the advertiser’s $10,000 daily budget across hundreds of ad servers. Uses Redis for fast atomic operations and PostgreSQL for durable audit logs.

Pre-Allocation (Budget → Ad Servers): Instead of checking budget on every impression, ad servers request budget chunks upfront (e.g., $50-$100). The Budget Controller atomically decrements the remaining budget using Redis DECRBY, preventing race conditions across distributed servers.

Return Unused Budget (Ad Servers → Budget): After serving ads, servers report actual spend. If they allocated $100 but only spent $95, they return $5 using Redis INCRBY. This prevents over-locking budget.

Timeout Monitor: Every 5 minutes, scans for allocations older than 10 minutes with no spend report. These likely represent crashed servers holding budget hostage. Automatically returns their allocations to the pool.

Dynamic Throttle: When budget drops below 10%, reduces allocation chunk size from $100 → $10. This prevents over-delivery when approaching budget exhaustion.

PostgreSQL Audit Log: All allocations and spends are logged to PostgreSQL for billing reconciliation and debugging, providing a durable audit trail.

Budget Allocation Algorithm:

The core algorithm has three operations:

1. Request Allocation:

Ad server requests budget chunk (e.g., $100) from Budget Controller
Controller atomically decrements remaining budget using Redis DECRBY (prevents race conditions)
If budget is low (<10% remaining), reduce allocation size to prevent over-delivery
Log allocation to PostgreSQL for audit trail
Return allocated amount (or 0 if budget exhausted)

2. Report Spend:

Ad server reports actual spend after serving ads
If actual < allocated, return unused portion via Redis INCRBY
Log actual spend to PostgreSQL for billing reconciliation
Example: Allocated $100, spent $87 → return $13 to budget pool

3. Timeout Monitor (Background):

Every 5 minutes, scan for allocations older than 10 minutes with no spend report
These likely represent crashed servers holding budget hostage
Automatically return their allocations to the budget pool via INCRBY
Prevents budget being permanently locked by failed servers

Key design decisions:

Redis for speed: Atomic counters provide strong consistency with <1ms latency
PostgreSQL for durability: Audit log enables billing reconciliation and debugging
Pre-allocation strategy: Servers get budget chunks upfront, avoiding per-request latency
Dynamic sizing: Reduce allocation chunks when budget is low to minimize over-delivery risk

Over-delivery bound: $$\text{OverDelivery}_{max} = S \times A$$

Where (S) = number of servers, (A) = allocation per server

Example: 100 servers × 1% daily budget allocation = 100% max over-delivery (entire daily budget)

Mitigation: When budget <10%, reduce allocation: $$A_{new} = \frac{B_r}{S \times 10}$$

Reduces max over-delivery to ~1%.

Fraud Detection: Multi-Tier Filtering

Problem: Detect and block fraudulent ad clicks in real-time without adding significant latency.

Fraud Types:

Click Farms: Bots or paid humans generating fake clicks
SDK Spoofing: Fake app installations reporting ad clicks
Domain Spoofing: Fraudulent publishers misrepresenting site content
Ad Stacking: Multiple ads layered, only top visible but all “viewed”

Detection Strategy: Multi-Tier Filtering

    
    graph TB
    REQ[Ad Request/Click] --> L1{L1: Bloom Filter
Simple Rules
0ms overhead}

    L1 -->|Known bad IP| BLOCK1[Block
Bloom Filter]
    L1 -->|Pass| L2{L2: Behavioral
5ms latency}

    L2 -->|Suspicious pattern| PROB[Probabilistic Block
50% traffic]
    L2 -->|Pass| L3{L3: ML Model
Async, 20ms}

    L3 -->|Fraud score > 0.8| BLOCK3[Post-Facto Block
Refund advertiser]
    L3 -->|Pass| SERVE[Serve Ad]

    PROB --> SERVE
    SERVE -.->|Log for analysis| OFFLINE[Offline Analysis
Update models]

    OFFLINE -.->|Update rules| L1
    OFFLINE -.->|Retrain model| L3

    classDef request fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef filter fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    classDef block fill:#ffebee,stroke:#d32f2f,stroke-width:2px
    classDef warning fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef success fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef async fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px

    class REQ request
    class L1,L2,L3 filter
    class BLOCK1,BLOCK3 block
    class PROB warning
    class SERVE success
    class OFFLINE async

Diagram explanation:

L1 (Bloom Filter): Every request first checks against a Bloom filter containing 100K known fraudulent IPs. This adds virtually zero latency (~1 microsecond) and immediately blocks obvious repeat offenders. The filter has 0.01% false positive rate but zero false negatives - we never accidentally allow known fraudsters.

L2 (Behavioral Heuristics): Requests that pass L1 enter behavioral scoring with <5ms overhead. This synchronous layer uses simple rules to score suspicious patterns without ML complexity.

L3 (ML Model): After serving the ad (to maintain low latency), an async ML model analyzes the request with 50+ features. High fraud scores (>0.8) trigger post-facto blocking and advertiser refunds.

Offline Analysis: Logs from all tiers feed back into model retraining and rule updates, creating a continuous improvement loop.

L1: Bloom Filter for IP Blocklist

Why Bloom filters are ideal for fraud IP blocking:

At 1M+ QPS, we need to check every incoming request against a blocklist of known fraudulent IPs. A naive hash table would require significant memory (several MB per server) and introduce cache misses.

Bloom filters solve this perfectly:

Space efficiency: 10⁷ bits (only 1.25 MB) can represent 100K IPs with 0.01% false positive rate. That’s 100× more space-efficient than a hash table storing full IP addresses. The entire data structure fits in L2 cache, enabling sub-microsecond lookups.

Zero false negatives: If a Bloom filter says an IP is NOT in the blocklist, it’s guaranteed correct. We never accidentally allow known fraudsters through. The only risk is false positives (0.01% chance of blocking a legitimate IP), which is acceptable - we have L2/L3 filters to catch legitimate users.

Constant-time lookups: O(k) hash operations regardless of set size. With k=7 hash functions, we can check membership in <1 microsecond - adding virtually zero latency to the request path.

Lock-free reads: Multiple threads can query simultaneously without contention. Critical for handling 1M+ QPS across hundreds of cores.

L2: Behavioral Heuristics (<5ms overhead)

Scoring (0-1 scale, >0.6 triggers blocking):

Click rate anomaly (40% weight): >30 clicks/min = bot behavior
User agent switching (20% weight): >5 different agents from same IP = suspicious
Geographic inconsistency (15% weight): US IP + Chinese browser language = potential fraud
Abnormal CTR (25% weight): >50% CTR (clicking every ad) = clearly fake

Why not ML here? L2 must be synchronous (runs before ad serving). ML models (20ms) run async post-serving to avoid adding latency.

L3: ML Fraud Score (Async)

GBDT model with 50+ features:

Device fingerprint entropy: Low entropy = scripted behavior
Click timestamp distribution: Uniform intervals → bot, bursty → human
Network ASN: Hosting provider vs ISP (data centers are suspicious)
Session depth: Fraud typically has shallow sessions

Handling class imbalance (1% fraud):

Adjust class weights: fraud=99, legitimate=1 in loss function
SMOTE oversampling: generate synthetic fraud examples to balance training data

Economic trade-off: $$\text{Cost} = C_{fp} \times FP + C_{fn} \times FN$$

Where (C_{fn} \gg C_{fp}) (advertiser refund ≫ lost revenue from blocking legitimate), we optimize for low false negatives (catch fraud even if we occasionally block legitimate traffic).

Multi-Region Failover

Timeline when US-East (60% traffic) fails:

T+10s: Load balancer detects failure T+30s: US-West sees 3x traffic, CPU 40%→85% T+60s: Auto-scaling triggered T+90s: Cache hit rate drops 60%→45% (different user distribution) T+150s: New instances online, capacity restored

Queuing Theory Insight:

Server utilization: (\rho = \lambda / (c \mu))

When (\rho > 0.8), queue length grows exponentially. When (\rho \geq 1.0), system is unstable.

Example:

Normal: 10K QPS, 100 servers (150 QPS capacity) → (\rho = 0.67) ✓
Failure: 30K QPS, 120 servers → (\rho = 1.67) ✗ Overload

Mitigation: Graceful Degradation

Increase cache TTL: 30s → 300s
Disable expensive features (skip ML, use rule-based targeting)
Load shedding: Reject low-value traffic

Load Shedding Strategy:

When utilization >90%:

Estimate revenue per request: (\text{CTR} \times \text{bid})
Calculate P95 threshold
Always accept high-value (>P95)
Probabilistically reject 50% of low-value

Impact: Preserves ~97.5% revenue while shedding 47.5% traffic.

Standby capacity calculation: $$c_{standby} = c_{normal} \times \left(\frac{\lambda_{spike}}{\lambda_{normal}} - 1\right)$$

For 3× spike: 100 servers × (3-1) = 200 servers (200% over-provisioning)

Cost: Expensive. Alternative: accept degraded performance during rare failures.

Part 8: Observability and Operations

Service Level Indicators and Objectives

Key SLIs:

Service	SLI	Target	Why
Ad API	Availability	99.9%	Revenue tied to successful serves
Ad API	Latency	p95 <100ms, p99 <150ms	Mobile timeouts above 150ms
ML	Accuracy	AUC >0.78	Below 0.75 = 15%+ revenue drop
RTB	Response Rate	>80% DSPs within 30ms	<80% = remove from rotation
Budget	Consistency	Over-delivery <1%	>5% = refunds + complaints

Error Budget Policy (99.9% = 43 min/month):

When budget exhausted:

Freeze feature launches (critical fixes only)
Focus on reliability work
Mandatory root cause analysis
Next month: 99.95% target to rebuild trust

Incident Response Dashboard

When ML latency alert fires at 2 AM, show:

Top-line (what’s broken):

Current p99 latency vs SLO (145ms vs <80ms)
Error rate vs SLO (2.3% vs <0.1%)

Resource utilization (service health):

GPU utilization (65% = normal)
Model version loaded (verify correct version)

Dependency latencies (find bottleneck):

Feature fetch: 95ms p99 ← SPIKE from 5ms baseline
Model inference: 40ms (normal)

Downstream health:

Cassandra: disk_io_wait, pending_compactions, read_latency

Historical context:

Related incidents: INC-1247 (2 weeks ago - Cassandra compaction)

Auto-suggested runbook:

Check nodetool cfstats for compactions
If storm detected: nodetool stop compaction
Tune compaction_throughput_mb_per_sec

Distributed Tracing

Single user reports “ad not loading” among 1M+ req/sec:

Request ID: 7f3a8b2c...
Total: 287ms (VIOLATED SLO)

├─ API Gateway: 2ms
├─ User Profile: 45ms
│  └─ Redis: 43ms (normally 5ms)
│     └─ TCP timeout: 38ms
│        └─ Cause: Node failure, awaiting replica
├─ ML Inference: 156ms
│  └─ Batch incomplete: 8/32
│     └─ Cause: Low traffic (Redis failure reduced QPS)
└─ RTB: 84ms

Root cause: Redis node failure → cascading slowdown. Trace shows exactly why.

Security and Compliance

Service-to-Service Auth (mTLS):

# Istio PeerAuthentication
mtls:
  mode: STRICT  # Reject non-mTLS

# Authorization: Ad Server can't call Budget DB directly
principals: ["cluster.local/ns/prod/sa/ad-server"]
paths: ["/predict"]  # ML inference only

PII Protection:

Encryption at rest: KMS-encrypted Cassandra columns
Column-level encryption: Only ML pipeline has decrypt permission
Data minimization: Hashed user IDs, no email/name in ad requests
Log scrubbing: user_id=[REDACTED]

Secrets: Vault with Dynamic Credentials

Lease credentials auto-rotated every 24h
Audit log: which service accessed what when
Revoke access instantly if compromised

ML Data Poisoning Protection:

    
    graph TD
    RAW[Raw Events] --> VALIDATE{Validation}

    VALIDATE --> CHECK1{CTR >3σ?}
    VALIDATE --> CHECK2{Low IP entropy?}
    VALIDATE --> CHECK3{Uniform timing?}

    CHECK1 -->|Yes| QUARANTINE[Quarantine]
    CHECK2 -->|Yes| QUARANTINE
    CHECK3 -->|Yes| QUARANTINE

    CHECK1 -->|No| CLEAN[Training Data]

    CLEAN --> TRAIN[Training] --> SIGN[GPG Sign]
    SIGN --> REGISTRY[Model Registry]

    REGISTRY --> INFERENCE[Inference]
    INFERENCE --> VERIFY{Verify GPG}

    VERIFY -->|Valid| LOAD[Load Model]
    VERIFY -->|Invalid| REJECT[Reject + Alert]

    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef check fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    classDef danger fill:#ffebee,stroke:#d32f2f,stroke-width:2px
    classDef safe fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef process fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px

    class RAW input
    class VALIDATE,CHECK1,CHECK2,CHECK3,VERIFY check
    class QUARANTINE,REJECT danger
    class CLEAN,LOAD safe
    class TRAIN,SIGN,REGISTRY,INFERENCE process

Three validation checks:

CTR anomaly: >3σ spike (e.g., 2%→8%)
IP entropy: <2.0 (clicks from narrow range = botnet)
Temporal pattern: Uniform intervals (human=bursty, bot=mechanical)

Model integrity: GPP signature prevents loading tampered models even if storage compromised.

Retention policies:

Data	Retention	Rationale
Raw events	7 days	Real-time only; archive to S3
Aggregated metrics	90 days	Dashboard queries
Model training data	30 days	Older data less predictive
User profiles	365 days	GDPR; inactive purged
Audit logs	7 years	Legal compliance

GDPR “Right to be Forgotten”:

Deletion across 10+ systems in parallel:

Cassandra: DELETE user_profiles
Redis: FLUSH user keys
Kafka: Publish tombstone (log compaction)
ML training: Mark deleted
Backups: Crypto erasure (delete encryption key)

Verification: All systems confirm → send deletion certificate to user within 48h.

Part 9: Production Operations at Scale

Progressive Rollout Strategy

    
    graph TD
    CODE[Main Branch] --> BUILD[CI Build]

    BUILD --> CANARY{Canary
1%, 15min}
    CANARY -->|OK| STAGE1[10%
US-East, 30min]
    CANARY -->|Fail| ROLLBACK1[Auto-rollback]

    STAGE1 -->|OK| STAGE2[50%
US-East+West, 1hr]
    STAGE1 -->|Fail| ROLLBACK2[Auto-rollback]

    STAGE2 -->|OK| STAGE3[100%
All regions, 24hr]
    STAGE2 -->|Fail| ROLLBACK3[Auto-rollback]

    STAGE3 --> COMPLETE[Complete]

    classDef build fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef stage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px
    classDef rollback fill:#ffebee,stroke:#d32f2f,stroke-width:2px
    classDef success fill:#e8f5e9,stroke:#4caf50,stroke-width:3px
    classDef decision fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px

    class CODE,BUILD build
    class STAGE1,STAGE2,STAGE3 stage
    class ROLLBACK1,ROLLBACK2,ROLLBACK3 rollback
    class COMPLETE success
    class CANARY decision

Automated gates:

Error rate: <0.1% increase
Latency: p95 <110ms, p99 <150ms
Revenue drop: <2%
Circuit breakers: no new opens
Database health: latency within range

Feature flags for blast radius control:

ml_inference_v2:
  enabled: true
  rollout: 5%
  targeting: ["internal_testing"]
  kill_switch: true

Benefits:

Deploy code dark, enable gradually
Instant rollback via flag (no redeploy)
A/B testing built-in

Deep Observability: Error Budgets in Practice

Error budget example (99.9% = 43 min/month):

Burn rate alert: If consuming >10× rate, outage in <3 hours → page on-call.

Why 99.9% not 99.99%?

99.99% requires multi-region active-active (2× infrastructure cost)
Business impact analysis:
- 99.9% downtime: ~43 min/month
- 99.99% downtime: ~4 min/month
ROI calculation: 2× infrastructure cost vs 10× reduction in downtime
At current scale, incremental revenue gain doesn’t justify doubling infrastructure spend

Cost Management at Scale

Resource attribution by team:

Resource	Owner	Charge-back Model
Redis (1000 nodes)	Platform	GB-hours per service
Cassandra (500 nodes)	Data	Storage + IOPS per table
ML GPUs (100× A100)	ML	GPU-hours (training vs inference)
Kafka (200 brokers)	Data	GB ingested + stored per topic
K8s (5000 pods)	Compute	vCPU-hours + RAM-hours per namespace

Cost optimization:

Right-sizing: CPU usage should average 70% (not 30%)
Spot instances for ML training: 70% cheaper, tolerate interruptions
Tiered storage: Hot (7d) → Warm (30d) → Cold (365d) = 67% savings
Reserved instances: 60% baseline reserved = 0.6× rate, 40% on-demand

Relative cost relationships:

Memory-optimized: 1.5-2× baseline
GPU instances: 10-30× baseline
NVMe storage: 3-5× standard disk
Network egress: 10-100× ingress
Spot: 0.3× on-demand
Reserved: 0.6× on-demand

Key metrics:

vCPU-ms per request (efficiency trend)
Month-over-month consumption change (>15% = investigate)
Anomaly detection: >2σ spike → alert

Part 10: Strategic Considerations

Architecture to Business Value

Every technical decision maps to business KPIs:

Architecture	Technical Metric	Business KPI	Impact
Sub-100ms latency	p95 <100ms	CTR	+100ms = -1-2% CTR
Multi-tier caching	95% hit rate	Infrastructure cost	20× database load reduction
ML CTR (AUC >0.78)	Model accuracy	Revenue/impression	AUC 0.75→0.80 = +10-15% revenue
Budget pacing	Over-delivery <1%	Advertiser retention	>5% causes churn
Fraud detection	False positive <0.1%	Platform reputation	1% invalid traffic = complaints
Progressive rollouts	Deployment failure <1%	Velocity	Saves 500 engineer-hours/year
Error budgets	99.9% SLO	SLA compliance	Breach = 10-30% penalty
GDPR deletion	<48h SLA	Legal compliance	Violation = 4% revenue fine

Key insights:

Latency is revenue: Sub-100ms is competitive advantage
ML quality compounds: 0.05 AUC improvement = 10-15% revenue
Operational excellence enables velocity: Ship 3-5× faster with 10× lower risk
Cost efficiency unlocks scale: Optimizing vCPU-ms per request compounds significantly at 1M+ QPS
Compliance is existential: GDPR fines can be 4% of global revenue

Evolutionary Architecture

Scenario 1: Video Ads

New requirements: 500KB-2MB files, multi-bitrate streaming, VAST 4.0

Changes:

Add video CDN (separate from image CDN)
Video transcoding service
Client-side viewability SDK
New metrics: video completion rate

Stays same:

ML inference (same pipeline, different features)
Budget pacing (same algorithm)
Fraud detection (same behavioral analysis)
Observability (same SLIs/SLOs)

Migration: 6-month phased rollout, 1%→50%→100% traffic

Cost impact: Video CDN 10-20× more expensive. At 10M impressions/day × 1MB = 10TB/day CDN egress.

Scenario 2: Transformer Models

New requirements: GBDT → Transformer for 15-20% CTR improvement

Changes:

50-100 A100 GPUs (significant infrastructure investment)
User history service (last 100 page views)
Embedding cache (512-dim vectors, 10s TTL)
Triton Inference Server
Batch 16-32 requests (+3ms latency)

Stays same:

Feature store architecture
Training pipeline (different model type)
Deployment process (same progressive rollout)

A/B test requirement:

Metric	Control (GBDT)	Treatment (Transformer)	Decision
Revenue/impression	Baseline	1.143× baseline	+14.3% (good)
P99 latency	87ms	103ms	+18.4% (concerning)

Don’t ship yet. Options:

Optimize inference to <95ms
Find 10ms savings elsewhere
Negotiate new SLO: “103ms acceptable for 14% revenue lift”

Balancing User Trust vs Revenue

Core principle: Optimize for long-term user trust, which is the foundation for sustainable revenue.

Guardrail metrics (all experiments must pass):

Guardrail	Threshold	Why
Session duration	Must not decrease >2%	Ads driving users away
Latency p99	<100ms	Slow = reduced CTR + satisfaction
Ad block rate	Must not increase >0.5%	Poor quality signal
Error rate	<0.1%	Technical failures erode trust

Example: Interstitial Ads Experiment

Metric	Control	Treatment	Delta	Decision
Revenue/user	Baseline	1.47× baseline	+47%	Good
Session duration	12.3 min	9.8 min	-20%	Violated
Ad block rate	2.1%	3.9%	+86%	Violated

Do not ship. Short-term +47% revenue, but -20% session duration and +86% ad blocking signals unacceptable user experience. Invest in better ML models or native formats instead.

Designing for Failure: Chaos Engineering

SRE team mandate: Regularly inject failures into production during business hours.

Experiments:

Weekly: Pod termination

Randomly kill 10% of pods during peak
Expected: K8s reschedules in 30s, p99 <100ms
Success: Zero user-facing errors

Bi-weekly: Latency injection

Inject +50ms between ad server and Redis
Expected: Circuit breakers trip, fallback to stale cache
Success: Error rate <0.1%

Monthly: Regional failover

Simulate datacenter outage (mark us-east-1 unhealthy)
Expected: GLB redirects to us-west-2 in 60s
Success: Revenue loss <2%, zero data loss

Quarterly: Database failure

Take down 2/5 Cassandra nodes
Expected: RF=3, CL=QUORUM continues serving
Success: No user impact, repairs within 1h

Culture:

Announce 24h advance (but not exact time)
On-call monitors, ready to halt if thresholds breached
Post-mortem documents gaps, file fixes

Blast radius limits:

Affect ≤25% traffic initially
Auto-rollback if error >0.5% or latency >150ms
Avoid peak revenue hours

Managing Technical Debt

20% of engineering capacity per cycle allocated to debt paydown.

Why 20%?

Too low (5-10%): debt compounds → system becomes brittle
Too high (40-50%): over-investing in perfection vs learning from real usage

Prioritization framework:

Impact	Velocity	Risk	Priority	Example
Critical	High	High	P0	DB migration with no rollback
High	High	Low	P1	Monolithic service blocking 3 teams
Medium	Low	High	P1	Flaky test (10% failure)
Low	Low	Low	P2	Code style inconsistency

Cultural norm: “Leave it better than you found it”

Add test if you find poor coverage
Improve logging if you debug an issue
Update runbook if you find it outdated

Part 11: Resilience and Failure Scenarios

A robust architecture must survive catastrophic failures, security breaches, and business model pivots. This section addresses three critical scenarios:

Catastrophic Regional Failure: When an entire AWS region fails, our semi-automatic failover mechanism combines Route53 health checks (2-minute detection) with manual runbook execution to promote secondary regions. The critical challenge is budget counter consistency—asynchronous Redis replication creates potential overspend windows during failover. We mitigate this through pre-allocation patterns that limit blast radius to allocated quotas per ad server, bounded by replication lag multiplied by allocation size.

Malicious Insider Attack: Defense-in-depth through zero-trust architecture (SPIFFE/SPIRE for workload identity), mutual TLS between all services, and behavioral anomaly detection on budget operations. Critical financial operations like budget allocations require cryptographic signing with Kafka message authentication, creating an immutable audit trail. Lateral movement is constrained through Istio authorization policies enforcing least-privilege service mesh access.

Business Model Pivot to Guaranteed Inventory: Transitioning from auction-based to guaranteed delivery requires strong consistency for impression quotas. Rather than replacing our eventual-consistent stack, we extend the existing pre-allocation pattern—PostgreSQL maintains source-of-truth counters while Redis provides fast-path allocation with periodic reconciliation. This hybrid approach adds only 10ms to the critical path for guaranteed campaigns while preserving sub-millisecond performance for auction traffic. The 12-month evolution path reuses 80% of existing infrastructure (ML pipeline, feature store, Kafka) while adding campaign management and SLA tracking layers.

These scenarios validate that the architecture is not merely elegant on paper, but battle-hardened for production realities: regional disasters, adversarial threats, and fundamental business transformations.

Part 12: Organizational Design

Team Structure

1. Ad Platform Team

Ad serving API, auction, budget pacing, fraud
RTB integration (OpenRTB)
SLA: 99.9% uptime, p99 <100ms

2. ML Platform Team

Model training/serving
Feature stores
Experimentation platform
SLA: AUC >0.78, p99 <20ms inference

3. Data Platform Team

Kafka infrastructure
Data warehouse
GDPR compliance tooling
SLA: 10-min freshness, 48h deletion

4. SRE/Infrastructure Team

Kubernetes (200+ nodes)
Redis/Cassandra operations
Observability infrastructure
Cost optimization

Cross-Team Collaboration

API contracts as team boundaries:

ML Team provides:
  POST /ml/predict/ctr
  SLA: p99 <20ms, 99.95% availability

Ad Team consumes without knowing:
  - Model type (GBDT vs Transformer)
  - Feature computation
  - GPU cluster location

Shared observability: All teams contribute to shared dashboards showing business, ML, and infrastructure metrics.

Quarterly roadmap alignment:

Team	Q3 Goals	Dependencies
Ad Platform	Launch video ads	Data: Video CDN ML: Engagement model
ML Platform	Improve AUC 0.78→0.82	Data: Browsing history SRE: 50 GPUs
Data Platform	GDPR deletion 48h→24h	SRE: Cassandra upgrade
SRE	Reduce cost 10%	All: Adopt spot instances

Tiered on-call:

L1 (Ad Platform): First responder
L2 (ML/Data): Backend escalation
L3 (SRE): Infrastructure escalation

Example incident flow:

12:00 AM: Latency alert (250ms)
12:01 AM: L1 paged
12:02 AM: L1 sees ML latency spike → escalate L2
12:05 AM: L2 finds GPU throttling → escalate L3
12:10 AM: L3 identifies HVAC failure → failover
12:15 AM: Latency restored

Conclusion

This design exercise explored both Day 1 architecture and Day 2 operational realities for a large-scale ads platform:

Architecture:

Requirements: 400M DAU, 1.5M QPS, sub-100ms
Multi-tier caching achieving 95% hit rates
RTB integration via OpenRTB with 30ms auctions
Real-time ML inference within 40ms budget
GSP auction mechanisms with game-theoretic properties
Budget pacing, fraud detection, multi-region failover

Operations:

Progressive rollouts with automated gates
SLIs, SLOs, error budgets
mTLS, PII encryption, secrets management
Schema evolution, GDPR compliance
Cost management and attribution

Key Takeaways:

Latency Budget Discipline: Every millisecond counts. Decompose the 100ms budget and optimize critical paths.

Caching is Critical: 95%+ hit rates achievable with three-tier architecture (L1 in-process + L2 Redis + L3 database).

Mathematical Foundations: From queueing theory to game theory to probability, rigorous analysis prevents costly production issues.

Distributed Systems Trade-Offs: Choose consistency level based on data type: strong for financial, eventual for preferences.

Graceful Degradation: Circuit breakers, timeouts, fallbacks prevent cascading failures. Load shedding protects during overload.

Operations Are Not an Afterthought: Progressive rollouts, observability, security, and cost governance are essential for running at scale.

Security and Compliance Are Design Constraints: PII protection and GDPR require architectural decisions from day one. Retrofitting is orders of magnitude harder.

Cost Scales Faster Than Revenue: Without active management, infrastructure spending spirals. Treat cost per request as a first-class metric.

While this remains a design exercise, the patterns explored represent well-established distributed systems approaches. The value was applying principles - queueing theory, game theory, consistency models, resilience patterns - to real-time ad serving constraints.

For anyone actually building such a system:

Start simpler - most don’t need 1M QPS from day one
Evolve based on real traffic rather than speculative requirements
Invest in observability early - you can’t optimize what you can’t measure
Build security/compliance into foundation - nearly impossible to retrofit
Track costs from the beginning - they compound quickly at scale

The complexity documented here represents years of evolution. Understanding it helps make better early decisions that won’t need unwinding later.

References

Distributed Systems:

Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly.
Dean, J., & Barroso, L. A. (2013). The Tail at Scale. CACM, 56(2).

Caching: 3. Fitzpatrick, B. (2004). Distributed Caching with Memcached. Linux Journal. 4. Nishtala, R., et al. (2013). Scaling Memcache at Facebook. NSDI.

Auction Theory: 5. Vickrey, W. (1961). Counterspeculation, Auctions, and Competitive Sealed Tenders. Journal of Finance. 6. Edelman, B., et al. (2007). Internet Advertising and the Generalized Second-Price Auction. AER. 7. Myerson, R. (1981). Optimal Auction Design. Mathematics of Operations Research.

Machine Learning: 8. McMahan, H. B., et al. (2013). Ad Click Prediction: A View from the Trenches. KDD. 9. He, X., et al. (2014). Practical Lessons from Predicting Clicks on Ads at Facebook. ADKDD.

Real-Time Bidding: 10. IAB Technology Laboratory (2016). OpenRTB API Specification Version 2.5. 11. Yuan, S., et al. (2013). Real-Time Bidding for Online Advertising: Measurement and Analysis. ADKDD.

Fraud Detection: 12. Pearce, P., et al. (2014). Characterizing Large-Scale Click Fraud in ZeroAccess. ACM CCS.

Observability: 13. Beyer, B., et al. (2016). Site Reliability Engineering. O’Reilly. 14. Beyer, B., et al. (2018). The Site Reliability Workbook. O’Reilly. 15. Sigelman, B. H., et al. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google Technical Report.

Security & Compliance: 16. GDPR Regulation (EU) 2016/679. General Data Protection Regulation.

Cost Optimization: 17. AWS Well-Architected Framework (2024). Cost Optimization Pillar. 18. Cloud FinOps Foundation (2024). FinOps Framework.

Architecting Real-Time Ads Platforms: A Distributed Systems Engineer's Design Exercise

Introduction

Part 1: Requirements and Scale Analysis

Functional Requirements

Non-Functional Requirements

Scale Estimation

Part 2: System Architecture and Technology Stack

High-Level Architecture

Latency Budget Decomposition

Technology Stack Decisions

API Gateway Selection

Event Streaming Platform Selection

Container Orchestration Selection

Critical Path Analysis

Circuit Breaker Pattern

Part 3: Real-Time Bidding Integration

OpenRTB Protocol

Timeout Handling: Adaptive Strategy

HTTP/2 Connection Pooling

Geographic Distribution

Part 4: ML Inference Pipeline

Feature Engineering Architecture

Feature Vector Construction

Model Architecture Decision

Part 5: Distributed Caching

Multi-Tier Cache Technology Selection

Three-Tier Cache Hierarchy

Redis Cluster Sharding

Hot Partition Mitigation: Count-Min Sketch

Workload Isolation

Cache Invalidation

Part 6: Auction Mechanisms

Generalized Second-Price (GSP)

Game Theory: GSP vs VCG

Reserve Price Optimization

Part 7: Advanced Topics

Budget Pacing: Distributed Spend Control

Fraud Detection: Multi-Tier Filtering

Multi-Region Failover

Part 8: Observability and Operations

Service Level Indicators and Objectives

Incident Response Dashboard

Distributed Tracing

Security and Compliance

Data Lifecycle and GDPR

Part 9: Production Operations at Scale

Progressive Rollout Strategy

Deep Observability: Error Budgets in Practice

Cost Management at Scale

Part 10: Strategic Considerations

Architecture to Business Value

Evolutionary Architecture

Balancing User Trust vs Revenue

Designing for Failure: Chaos Engineering

Managing Technical Debt

Part 11: Resilience and Failure Scenarios

Part 12: Organizational Design

Team Structure

Cross-Team Collaboration

Conclusion

References