Architecting Real-Time Ads Platforms: A Distributed Systems Engineer's Design Exercise
• Author: Yuriy Polyulya •Introduction
Full disclosure: I haven’t built a production ads platform. As a distributed systems engineer, this problem represents a compelling intersection of my interests: real-time processing, distributed coordination, strict latency requirements, and financial accuracy at scale.
What draws me to this challenge is the complexity beneath a deceptively simple facade. A user opens an app, sees a relevant ad in under 100ms, and the advertiser gets billed accurately. Straightforward? Not when you’re coordinating auctions across dozens of bidding partners, running ML predictions in real-time, maintaining budget consistency across regions, and handling over a million queries per second.
Here’s the scale I’m designing for:
- 400M+ daily active users generating continuous ad requests
- 1M+ queries per second during peak traffic
- Sub-100ms p95 latency for the entire request lifecycle
- Real-time ML inference for click-through rate prediction
- Distributed auction mechanisms coordinating with 50+ external bidding partners
- Multi-region deployment with eventual consistency challenges
I’ll walk through my thought process on:
- Requirements modeling and performance constraints
- High-level architecture and latency budgets
- Real-Time Bidding (RTB) integration via OpenRTB
- ML inference pipelines constrained to <50ms
- Distributed caching strategies achieving 95%+ hit rates
- Auction mechanisms and game theory
- Advanced topics: budget pacing, fraud detection, multi-region failover
- What breaks and why - failure modes matter
Cost Pricing Caveat: Throughout this post, I’ll discuss technology cost comparisons. Please note that pricing varies significantly by cloud provider, region, contract negotiations, and changes frequently. The figures I mention are rough approximations to illustrate relative differences for decision-making, not exact pricing. Always verify current pricing for your specific use case.
Part 1: Requirements and Scale Analysis
Functional Requirements
Multi-Format Ad Delivery: Support story, video, and carousel ads across iOS, Android, and web. Creative assets served from CDN ensuring sub-100ms first-byte time.
Real-Time Bidding Integration: Implement OpenRTB 2.5+ to communicate with 50+ DSPs simultaneously. The IAB mandates a 30ms auction timeout - challenging when network RTTs to some DSPs approach 150ms. Handle both programmatic and guaranteed inventory with distinct SLAs.
ML-Powered Targeting:
- Real-time CTR prediction (serving random ads leaves money on the table)
- Conversion rate optimization
- Dynamic creative optimization
- Budget pacing algorithms (prevent advertisers from exhausting budgets in the first hour)
Campaign Management: Real-time performance metrics, A/B testing framework, frequency capping, quality scoring, and policy compliance.
Non-Functional Requirements
Latency Distribution: $$P(\text{Latency} \leq 100\text{ms}) \geq 0.95$$
Total latency is the sum across the request path: $$T_{total} = \sum_{i=1}^{n} T_i$$
With 5 services each taking 20ms, you’re already at 100ms with zero margin. This is why ruthless latency budgeting is critical.
Throughput: $$Q_{peak} \geq 1.5 \times 10^6 \text{ QPS}$$
Little’s Law gives us server requirements. With service time (S) and (N) servers: $$N = \frac{Q_{peak} \times S}{U}$$
I use (U = 0.7) (70% utilization) for headroom. Running at 90%+ utilization invites disaster.
Availability: Targeting 99.995% uptime: $$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}} \geq 0.9995$$
This allows 26 minutes downtime per month. A bad deploy can consume that in seconds.
Consistency Requirements:
Not all data needs identical guarantees:
-
Financial data (billing): Strong consistency - no exceptions $$\forall t_1 < t_2: \text{Read}(t_2) \text{ observes } \text{Write}(t_1)$$
If an advertiser pays for 1000 impressions, they get exactly 1000. Not 999, not 1001.
-
User preferences: Eventual consistency acceptable $$\lim_{t \to \infty} P(\text{AllReplicas consistent}) = 1$$
If a user updates interests and sees old targeting for 10-20 seconds during propagation, it’s not critical.
-
Ad inventory: Bounded staleness (~30 seconds) $$|\text{Read}(t) - \text{TrueValue}(t)| \leq \epsilon \text{ for } t - t_{write} \leq 30s$$
Slightly stale counts are acceptable. Over-serving a campaign by 50 impressions out of 100K is negligible and can be reconciled later.
Scale Estimation
Data Volume:
- Daily requests: 400M users × 20 requests/user = 8B requests/day
- Daily logs (1KB each): 8TB/day
Storage:
- User profiles (10KB each): 4TB
- 30-day performance history (100B/impression): ~24TB
Cache Sizing: For 95% hit rate with Zipfian distribution (α = 1.0), cache ~20% of dataset:
- Required cache: 400M users × 20% × 10KB ≈ 800GB
- Distributed across 100 nodes: 8GB/node
Part 2: System Architecture and Technology Stack
High-Level Architecture
graph TD %% Client Layer CLIENT[Mobile/Web Client
iOS, Android, Browser] %% Edge Layer CLIENT -->|1. Creative Assets| CDN[CDN
CloudFront/Fastly] CLIENT -->|2. Ad Request| GLB[Global Load Balancer
GeoDNS + Health] %% Service Layer Entry GLB -->|3. Route to Region| GW[API Gateway
Kong: 3-5ms
Auth + Rate Limit] GW -->|4. Auth Complete| AS[Ad Server Orchestrator
100ms latency budget] %% Parallel Service Fork AS -->|5a. Fetch User| UP[User Profile Service
Target: 10ms] AS -->|5b. Retrieve Ads| AD_SEL[Ad Selection Service
Target: 15ms] AS -->|5c. Predict CTR| ML[ML Inference Service
Target: 40ms] AS -->|5d. Run Auction| RTB[RTB Auction Service
Target: 30ms] %% External DSPs RTB -->|6d. Bid Request| EXTERNAL[50+ DSP Partners
External Bidders] %% Data Layer - Redis UP -->|6a. Cache Lookup| REDIS[(Redis Cluster
1000 nodes)] AD_SEL -->|6b. Cache Lookup| REDIS %% Data Layer - Cassandra UP -->|7a. On Cache Miss| CASS[(Cassandra
RF=3, CL=QUORUM)] AD_SEL -->|7b. On Cache Miss| CASS %% Data Layer - Feature Store ML -->|6c. Feature Fetch| FEATURE[(Feature Store
Tecton)] %% Event Streaming Pipeline AS -->|8. Log Events| KAFKA[Kafka
100K events/sec] KAFKA -->|9a. Stream Events| FLINK[Flink
Stream Processing] FLINK -->|10a. Update Cache| REDIS FLINK -->|10b. Archive| S3[(S3/HDFS
Data Lake)] %% Batch Processing Pipeline S3 -->|9b. Daily Batch| SPARK[Spark
Batch Processing] SPARK -->|10c. Update Features| FEATURE %% ML Training Pipeline S3 --> AIRFLOW[Airflow
Orchestration] AIRFLOW -->|11. Schedule| TRAIN[Training Cluster
Daily CTR Model] TRAIN -->|12. Store Model| REGISTRY[Model Registry
Versioning] REGISTRY -->|13. Deploy| ML %% Observability AS -.->|Metrics| PROM[Prometheus
Metrics] AS -.->|Traces| JAEGER[Jaeger
Tracing] PROM -->|Visualize| GRAF[Grafana
Dashboards] classDef client fill:#e1f5ff,stroke:#0066cc,stroke-width:2px classDef edge fill:#fff4e1,stroke:#ff9900,stroke-width:2px classDef service fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef data fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef stream fill:#ffe0b2,stroke:#e65100,stroke-width:2px classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px class CLIENT client class CDN,GLB edge class GW,AS,UP,AD_SEL,ML,RTB service class REDIS,CASS,FEATURE data class KAFKA,FLINK,SPARK,S3,AIRFLOW,TRAIN,REGISTRY stream class EXTERNAL external class PROM,JAEGER,GRAF stream
Diagram explanation - Request Flow:
Steps 1-4 (Client → Service Layer):
- Creative Assets: Client fetches ad images/videos from CDN (separate from ad selection)
- Ad Request: Client sends request to Global Load Balancer
- Route to Region: GLB uses GeoDNS to route to nearest healthy region
- Authentication: Kong gateway authenticates (JWT), rate limits, and forwards to Ad Server
Steps 5a-5d (Parallel Service Calls): The Ad Server Orchestrator makes 4 parallel calls with independent latency budgets:
- 5a. User Profile (10ms): Fetch user demographics, interests, browsing history
- 5b. Ad Selection (15ms): Retrieve candidate ads matching user’s profile
- 5c. ML Inference (40ms): Predict CTR for each candidate ad (critical path bottleneck)
- 5d. RTB Auction (30ms): Send bid requests to 50+ external DSPs via OpenRTB protocol
Steps 6a-6d (Data Access - Cache First): Each service first checks cache before hitting persistent storage:
- 6a-6b. Redis Cache: User Profile and Ad Selection check Redis (5ms p99)
- 6c. Feature Store: ML fetches pre-computed features from Tecton (10ms p99)
- 6d. External DSPs: RTB waits max 30ms for external bidder responses
Steps 7a-7b (Cache Miss → Database): On cache miss, services read from Cassandra (20ms p99). This happens for ~5% of requests with 95% cache hit rate.
Steps 8-10 (Data Pipeline - Post-Request): After serving the ad (asynchronously):
- 8. Log Events: Ad Server publishes impression/click events to Kafka
- 9a. Stream Processing: Flink computes real-time aggregations (last hour CTR)
- 9b. Batch Processing: Spark processes historical data from S3 daily
- 10a-10c. Feature Updates: Both pipelines update feature stores (Redis + Tecton)
Steps 11-13 (ML Training Loop - Daily):
- 11. Schedule: Airflow triggers daily model retraining
- 12. Store Model: Trained models versioned in registry
- 13. Deploy: New models gradually deployed to inference service (canary rollout)
Observability (Continuous):
- Ad Server emits metrics (latency, error rate) to Prometheus
- Distributed traces sent to Jaeger for debugging
- Grafana visualizes both for dashboards and alerts
Latency Budget Decomposition
For 100ms total budget:
$$T_{total} = T_{network} + T_{gateway} + T_{services} + T_{serialization}$$
Network (10ms):
- Client to edge: 5ms
- Edge to service: 5ms
API Gateway (5ms):
- Authentication: 2ms
- Rate limiting: 1ms
- Request enrichment: 2ms
Service Layer (75ms): Parallel execution bounded by slowest service:
- User Profile: 10ms
- Ad Selection: 15ms
- ML Inference: 40ms ← bottleneck
- RTB Auction: 30ms
With parallelization: 40ms (ML inference)
Remaining:
- Auction logic: 5ms
- Serialization: 5ms
Total: 65ms with 35ms headroom for variance.
Technology Stack Decisions
API Gateway Selection
For a 5ms latency budget, the API gateway choice is critical. Here’s my detailed analysis:
Technology | Latency | Throughput | Rate Limiting | Auth Methods | Ops Complexity |
---|---|---|---|---|---|
Kong | 3-5ms | 100K/node | Plugin-based | JWT, OAuth2, LDAP | Medium |
AWS API Gateway | 5-10ms | 10K/endpoint | Built-in | IAM, Cognito, Lambda | Low (managed) |
NGINX Plus | 1-3ms | 200K/node | Lua scripting | Custom modules | High |
Envoy | 2-4ms | 150K/node | Extension filters | External auth | High |
Decision: Kong
Why Kong won:
- Plugin ecosystem: Rate limiting, auth, transformations work out of the box
- Latency overhead: 3-5ms fits comfortably within 5ms budget
- Cost efficiency: Self-hosted avoids per-request AWS charges
- Developer experience: Declarative config and OpenAPI integration
The breakdown within my 5ms budget:
- TLS termination: ~1ms
- Authentication (JWT verify): ~2ms
- Rate limiting (Redis lookup): ~1ms
- Request routing: ~1ms
- Total: 5ms - tight but workable
Why I was tempted by NGINX Plus:
- That 1-3ms latency is genuinely attractive - best in class
- 200K RPS/node is impressive
- But writing custom Lua for complex auth would add weeks to development
- Finding engineers who know both Lua and auth patterns is harder than it should be
Why I ruled out AWS API Gateway:
At 1M QPS (~86B requests/day), AWS’s per-request pricing becomes problematic:
- Rough cost: At high sustained throughput, can reach hundreds of thousands per month
- The 5-10ms latency overhead is also problematic when I need 2ms for auth
- Great for serverless/event-driven workloads, but not for sustained high throughput
Cost comparison (approximate, varies by region/contract):
Kong self-hosted:
- ~10 nodes with appropriate sizing
- Enterprise license if needed
- Ballpark: ~$5-15K/month depending on configuration
AWS API Gateway:
- Per-request pricing at billions of requests/month
- Ballpark: Could be 10-30× more expensive than self-hosted at this scale
The cost difference at high sustained throughput is significant - potentially enough to fund multiple engineer salaries.
Event Streaming Platform Selection
Before choosing stream processing frameworks, I need the right event streaming backbone. This is a foundational decision:
Technology | Throughput/Partition | Latency (p99) | Durability | Ordering | Scalability |
---|---|---|---|---|---|
Kafka | 100MB/sec | 5-15ms | Disk-based replication | Per-partition | Horizontal (add brokers/partitions) |
Pulsar | 80MB/sec | 10-20ms | BookKeeper (distributed log) | Per-partition | Horizontal (separate compute/storage) |
RabbitMQ | 20MB/sec | 5-10ms | Optional persistence | Per-queue | Vertical (limited) |
AWS Kinesis | 1MB/sec/shard | 200-500ms | S3-backed | Per-shard | Manual shard management |
Decision: Kafka
Here’s my reasoning:
- Throughput: 100MB/sec per partition handles peak load (100K events/sec × 1KB/event = 100MB/sec)
- Latency: 5-15ms p99 leaves headroom in my 100ms feature freshness budget
- Durability: Disk-based replication (RF=3) means no data loss when brokers fail
- Ecosystem: Everything works with Kafka - Flink, Spark, Kafka Connect. “Just works” is generous (you’ll tune GC and debug consumer rebalancing), but compared to alternatives, it’s solid
- Ordering: Per-partition ordering is critical for event causality (can’t process a click before the impression)
Partitioning strategy:
For 100K events/sec across 100 partitions: 1,000 events/sec per partition
Partition key: user_id % 100
ensures:
- All events for a user go to same partition (maintains ordering)
- Balanced distribution (assuming uniform user distribution)
Pulsar consideration:
Pulsar’s architecture is genuinely elegant with storage/compute separation, but:
- The ecosystem maturity gap is real
- Separate BookKeeper layer adds operational complexity
- For my 7-day retention requirement, advanced storage architecture seems unnecessary
Why I didn’t pick RabbitMQ:
I like RabbitMQ for certain use cases, but:
- 20MB/sec throughput ceiling is way below what I need (100MB/sec)
- Scaling is painful - mostly vertical, horizontal gets messy
- Built for task queues, not event streaming
Cost comparison (approximate, varies by region/contract):
Kafka self-hosted:
- Several brokers with appropriate compute/storage
- NVMe SSDs for performance
- ZooKeeper cluster for coordination
- Ballpark: ~$3-5K/month for this workload
AWS Kinesis alternative:
- Per-shard hourly fees
- Per-PUT request fees at billions of PUTs per month
- Ballpark: Could be 20-50× more expensive than self-hosted Kafka
The cost difference is substantial at high sustained throughput. Kinesis might make sense for bursty or low-volume workloads where operational simplicity matters more than cost.
Stream Processing: Flink
- <100ms latency for windowed aggregations
- True streaming (not micro-batches)
- Exactly-once semantics critical for financial data
- Distributed snapshots for stateful processing
Batch Processing: Spark
- In-memory processing for feature engineering
- MLlib for statistical aggregations
- SQL interface for data scientists
Feature Store: Tecton
- <10ms p99 serving latency
- 100ms feature freshness
- Managed service reduces operational burden
- Trade-off: Vendor lock-in vs engineering time saved
L2 Cache: Redis Cluster
- 5ms p99 latency
- Atomic operations for budget counters
- Complex data structures (sorted sets, hashes)
- Persistence for crash recovery
- Trade-off: 30% higher memory vs Memcached for operational features
Persistent Store: Cassandra
- Linear scalability to 4TB+ user profiles
- Multi-DC replication built-in
- Tunable consistency (CL=QUORUM for balance)
- At 8B requests/day, self-hosted is 6× cheaper than DynamoDB
Container Orchestration Selection
Container orchestration is critical for handling GPU scheduling and auto-scaling. Here’s my comparison:
Technology | Learning Curve | Ecosystem | Auto-scaling | Multi-cloud | Networking |
---|---|---|---|---|---|
Kubernetes | Steep | Massive (CNCF) | HPA, VPA, Cluster Autoscaler | Yes (portable) | Advanced (CNI, Service Mesh) |
AWS ECS | Medium | AWS-native | Target tracking, step scaling | No (AWS-only) | AWS VPC |
Docker Swarm | Easy | Limited | Basic (replicas) | Yes (portable) | Overlay networking |
Nomad | Medium | HashiCorp ecosystem | Auto-scaling plugins | Yes (portable) | Consul integration |
Decision: Kubernetes
Look, Kubernetes is the obvious choice, but let me explain why this isn’t just cargo-culting:
- Ecosystem: 78% adoption means you can hire people who know it. Try hiring for Docker Swarm or Nomad
- Auto-scaling: HPA based on custom metrics (like inference queue depth) is exactly what I need
- GPU support: Native GPU scheduling with node affinity - critical for ML workloads
- Service mesh: Istio/Linkerd integration gives circuit breaking and traffic splitting
- Portability: Deploy to AWS, GCP, Azure, or on-prem without rewriting
Unpopular opinion: Kubernetes is overly complex for 80% of use cases, but if you need GPU orchestration and multi-cloud portability, you’re kind of stuck with it.
Kubernetes-specific features critical for ads platform:
1. Horizontal Pod Autoscaler (HPA):
- Scale ML inference pods based on custom metrics (inference queue depth)
- Formula: (\text{desired\_replicas} = \lceil \text{current} \times \frac{\text{current\_metric}}{\text{target\_metric}} \rceil)
- Example: Target 80% CPU, current 90% across 10 pods → (\lceil 10 \times \frac{90}{80} \rceil = 12) pods
2. GPU Node Affinity:
- Schedule ML inference pods only on GPU nodes using node selectors
- Prevents GPU resource waste by isolating GPU workloads
3. Service Mesh with Sidecar Pattern (Istio/Envoy):
The sidecar pattern deploys a lightweight proxy (Envoy) alongside each service pod, intercepting all network traffic for observability, security, and resilience—without modifying application code.
graph TD subgraph POD1["Pod: Ad Server (Zone A)"] APP1[Ad Server
Application] SIDECAR1[Envoy Sidecar
50-100MB] APP1 -->|"1. localhost:15001
0.1ms"| SIDECAR1 end subgraph POD2["Pod: ML Inference (Zone A)"] SIDECAR2[Envoy Sidecar
Connection Pool] APP2[ML Service
Application] SIDECAR2 -->|"5. localhost
0.1ms"| APP2 end subgraph POD3["Pod: ML Inference (Zone B)"] SIDECAR3[Envoy Sidecar
Connection Pool] APP3[ML Service
Application] SIDECAR3 -->|localhost| APP3 end subgraph POD4["Pod: User Profile (Zone A)"] SIDECAR4[Envoy Sidecar] APP4[User Profile
Application] SIDECAR4 -->|localhost| APP4 end SIDECAR1 -->|"2. HTTP/2 pooled
same-zone: 1ms"| SIDECAR2 SIDECAR1 -.->|"cross-zone: 3ms
(avoided by locality)"| SIDECAR3 SIDECAR1 -->|"3. HTTP/2 pooled
1ms"| SIDECAR4 SIDECAR2 -->|"4. Response
aggregated"| SIDECAR1 CONTROL[Istio Control Plane
istiod] CONTROL -.->|"Config: routes,
policies, certs"| SIDECAR1 CONTROL -.->|Config| SIDECAR2 CONTROL -.->|Config| SIDECAR3 CONTROL -.->|Config| SIDECAR4 SIDECAR1 -.->|"Telemetry:
traces, metrics"| OBSERV[Prometheus
Jaeger] classDef pod fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef sidecar fill:#fff4e1,stroke:#ff9900,stroke-width:2px classDef app fill:#e1f5ff,stroke:#0066cc,stroke-width:2px classDef control fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px class POD1,POD2,POD3,POD4 pod class SIDECAR1,SIDECAR2,SIDECAR3,SIDECAR4 sidecar class APP1,APP2,APP3,APP4 app class CONTROL,OBSERV control
Diagram explanation - Sidecar Network Flow:
- Application to Local Sidecar (0.1ms): Ad Server makes request to
localhost:15001
, hitting local Envoy sidecar via loopback interface—no network latency - Sidecar to Sidecar (1ms same-zone): Envoy uses persistent HTTP/2 connection pool to downstream service’s sidecar. Connection already established, no TCP handshake.
- Parallel Requests: Single HTTP/2 connection multiplexes multiple requests (ML inference + User Profile fetch) simultaneously
- Response Path: Responses return through same pooled connections
- Sidecar to Application (0.1ms): Downstream sidecar forwards to local application via localhost
Without Sidecar (Direct Service Calls):
- Each request = new TCP connection (3-way handshake: 1-3ms)
- No locality awareness = random cross-zone calls (3ms vs 1ms)
- Manual retry logic in every service
- Total: ~5-8ms per request
With Sidecar:
- Connection pooling eliminates handshake overhead
- Locality-aware routing prefers same-zone (1ms)
- Automatic retries with circuit breaking
- Total: ~2-4ms per request (40-50% latency reduction)
Network Optimization Benefits:
- Connection pooling: Sidecars maintain persistent connections to downstream services, avoiding TCP handshake overhead (3-way handshake = 1-3ms per request). Application makes local calls to sidecar (0.1ms), sidecar reuses pooled connections.
- Protocol optimization: HTTP/2 multiplexing allows 100+ concurrent requests over single TCP connection, reducing connection overhead from 50+ connections to 2-3 per pod.
- Intelligent load balancing: Client-side load balancing with locality-aware routing (prefer same-zone pods, saving 1-2ms cross-AZ latency). Sidecar tracks real-time endpoint health, removing failed pods from rotation in <1s vs DNS TTL delays (30-60s).
- Retry budgets: Automatic retries with exponential backoff, but capped at 10% additional load to prevent cascading failures. Configurable per-route retry policies.
Operational Benefits:
- Traffic splitting: A/B test new ML model versions (90% traffic to v1, 10% to v2) with header-based routing
- Circuit breaking: Automatic failure detection with configurable thresholds (5xx rate >5% → open circuit, fast-fail requests)
- Observability: Automatic request tracing, RED metrics (Rate, Errors, Duration) per service without instrumentation
- Security: Mutual TLS between all services with automatic certificate rotation (SPIFFE/SPIRE identity)
Latency overhead: Sidecar adds 1-2ms per hop (network stack traversal: app → sidecar → network), but connection pooling and retry logic often save 3-5ms, resulting in net latency improvement of 1-3ms for typical request patterns.
Resource cost: Each Envoy sidecar consumes 50-100MB memory + 0.1-0.2 CPU cores. For 500 pods, this is 25-50GB RAM + 50-100 cores—significant but justified by operational benefits (automatic mTLS, retries, observability) that would otherwise require custom code in every service.
Why not AWS ECS?
ECS is tempting because it’s managed and cheaper:
- Vendor lock-in is real - can’t migrate to GCP/Azure without rewriting
- Auto-scaling limited to CPU/memory target tracking - no custom metrics
- GPU support is janky compared to Kubernetes
- Works great for simple microservices, but for complex ML infrastructure? Not so much
Why not Docker Swarm?
I wish Swarm had succeeded. It’s so much simpler than Kubernetes. But:
- Ecosystem is basically dead - 5% market share, stagnant development
- No GPU scheduling, basic auto-scaling, no service mesh
- Finding engineers who know Swarm is nearly impossible
- Docker Inc. has basically abandoned it in favor of Kubernetes
Cost trade-off (rough comparison for ~100 nodes):
Kubernetes (managed EKS):
- Control plane fees (managed)
- Worker node infrastructure
- Operational overhead (engineering time)
- Rough total: Varies widely depending on configuration
AWS ECS (Fargate):
- Per-vCPU and per-GB-memory pricing
- No control plane fees
- Lower operational overhead
- Generally 10-20% cheaper than Kubernetes for basic workloads
So why choose Kubernetes despite slightly higher costs?
GPU support and multi-cloud portability matter for this use case. ECS Fargate has limited GPU support, and I prefer not being locked into AWS. The premium (perhaps 10-20% higher monthly costs) acts as insurance against vendor lock-in and provides proper GPU scheduling for ML workloads.
Model Serving: TensorFlow Serving
- 30-40ms p99 latency
- Auto-batching for GPU efficiency
- gRPC support (5ms vs 15ms for REST)
- Model versioning for A/B testing
Observability Stack:
- Metrics: Prometheus + Grafana (self-hosted 5-10× cheaper than managed at scale)
- Tracing: Jaeger with Cassandra backend (adaptive sampling)
- Logs: ELK Stack for full-text search (Loki for high-volume app logs)
Deployment Strategy: Kubernetes with Autoscaling
- GPU instances: 3 min baseline, 10 max (50% cost vs always-on)
- 90-second warmup acceptable for burst traffic
Critical Path Analysis
graph TD A[Request Arrives
t=0ms] A -->|5ms: Auth + Rate Limit| B[Gateway
t=5ms] B --> FORK{Parallel Fork} FORK -->|10ms budget| C[User Profile
Cache: 5ms
DB miss: 20ms] FORK -->|15ms budget| D[Ad Selection
Filter 1M ads → 100] FORK -->|40ms budget| E[ML Inference
BOTTLENECK
Feature: 10ms
Model: 30ms] FORK -->|30ms budget| F[RTB Auction
50 DSPs parallel] C -->|Completes t=15ms| G[Join Point
Wait for all
t=45ms] D -->|Completes t=20ms| G E -->|Completes t=45ms| G F -->|Completes t=35ms| G G -->|5ms: Rank by eCPM
Select winner| H[Auction Logic
t=50ms] H -->|5ms: Serialize JSON
Return to client| I[Response Sent
t=55ms] classDef normal fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef bottleneck fill:#ffebee,stroke:#d32f2f,stroke-width:3px classDef gateway fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef join fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef fork fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px class A,B,H,I gateway class C,D,F normal class E bottleneck class G join class FORK fork
Diagram explanation - Critical Path Timing:
Request Arrives (t=0ms): User opens app, triggers ad request to Global Load Balancer
Gateway (t=0→5ms):
- JWT token validation: 2ms
- Rate limiting check (Redis lookup): 1ms
- Request routing to Ad Server: 2ms
- Total: 5ms
Parallel Service Calls (t=5ms → t=45ms): Ad Server forks 4 parallel requests. Total latency bounded by slowest service (not sum):
-
User Profile (10ms target, finishes t=15ms):
- Redis cache hit (95%): 5ms
- Cassandra on miss (5%): 20ms
- Weighted average: 0.95×5 + 0.05×20 = 5.75ms
-
Ad Selection (15ms target, finishes t=20ms):
- Retrieve ~1M candidate ads from Cassandra
- Apply filters (user interests, ad categories): 5ms
- Reduce to top 100 candidates: 10ms
- Total: 15ms
-
ML Inference (40ms target, finishes t=45ms) - CRITICAL PATH:
- Feature fetching: 10ms (from Tecton + Redis)
- Real-time features (time, device): 0ms
- Near-RT features (last hour CTR): Redis 5ms
- Batch features (user segments): Tecton 5ms
- Model inference: 30ms (TensorFlow Serving on GPU)
- Load 150-dim feature vector
- GBDT forward pass through 100 trees
- Return CTR prediction for each of 100 ad candidates
- Total: 40ms ← This determines overall latency
- Feature fetching: 10ms (from Tecton + Redis)
-
RTB Auction (30ms target, finishes t=35ms):
- Send OpenRTB requests to 50 DSPs in parallel
- Wait for responses (hard 30ms timeout)
- Collect ~40 valid bids (10 DSPs timeout)
- Total: 30ms
Join Point (t=45ms): Ad Server waits for all 4 parallel calls to complete. Since ML Inference takes longest (40ms), join completes at t=45ms. The other services finish earlier but must wait.
Auction Logic (t=45ms → t=50ms):
- Compute eCPM for all candidates: bid × CTR × 1000
- Sort by eCPM descending
- Select highest eCPM as winner
- Calculate second-price: 2nd highest eCPM / (winner CTR × 1000)
- Total: 5ms
Response (t=50ms → t=55ms):
- Serialize winner ad to JSON
- Send HTTP response to client
- Total: 5ms
Total end-to-end latency: 55ms (45ms headroom within 100ms SLA)
Why ML Inference is the bottleneck: Even though ML has the largest budget (40ms), it’s still the critical path because:
- Feature fetching has inherent network latency (10ms)
- Model inference requires GPU computation (30ms)
- Other services finish faster: User Profile (10ms), Ad Selection (15ms), RTB (30ms)
Optimization priority:
- Reduce ML feature fetch: Pre-warm cache, optimize feature store queries
- Accelerate model inference: Model quantization (FP32→INT8), batch optimization
- Parallelize within ML: Fetch features while loading model weights
Any improvement to ML Inference directly improves overall latency. Optimizing other services (e.g., User Profile 10ms→5ms) has no impact since they’re not on critical path.
Circuit Breaker Pattern
Prevent cascading failures with three states:
- CLOSED: Normal operation, monitor error rate
- OPEN: Error rate >10% for 1min → block requests, serve from cache
- HALF-OPEN: After exponential backoff, send 1% test traffic
State transitions: $$\text{If } E(t) > 0.10 \text{ for } \Delta t \geq 60s, \text{ trip circuit}$$
$$T_{backoff} = T_{base} \times 2^{n}, \quad T_{base} = 30s$$
Part 3: Real-Time Bidding Integration
OpenRTB Protocol
sequenceDiagram participant AS as Ad Server participant DSP1 as DSP #1 participant DSP2 as DSP #2-50 participant Auction as Auction Logic Note over AS,Auction: 100ms Total Budget AS->>AS: 1. Construct BidRequest
OpenRTB 2.x format
User context + Ad slot par Parallel DSP Calls (30ms timeout each) AS->>DSP1: 2a. HTTP POST /bid
OpenRTB BidRequest activate DSP1 Note over DSP1: DSP evaluates user
Checks budget
Runs own auction DSP1-->>AS: 3a. BidResponse
Price: $5.50
Creative ID deactivate DSP1 and AS->>DSP2: 2b. Broadcast to 50 DSPs
Parallel HTTP/2 connections activate DSP2 Note over DSP2: Each DSP responds
independently
Some may timeout DSP2-->>AS: 3b. Multiple BidResponses
[$3.20, $4.80, ...] deactivate DSP2 end Note over AS: 4. Timeout enforcement (30ms):
Discard late responses
Collected ~40 bids from 50 DSPs AS->>Auction: 5. Collected bids +
ML CTR predictions activate Auction Note over Auction: 6. Compute eCPM for each bid:
eCPM = bid × CTR × 1000
Sort by eCPM descending Auction->>Auction: 7. Run GSP Auction
Winner pays 2nd price Note over Auction: 8. Calculate price:
p = eCPM₂ / (CTR_winner × 1000) Auction-->>AS: 9. Winner + Price
DSP #1, $3.33 deactivate Auction AS-->>DSP1: 10. Win notification
(async, best-effort)
Creative URL Note over AS,Auction: Total elapsed: ~35ms
Within 100ms budget rect rgb(255, 240, 240) Note over AS,DSP2: Failure handling:
- 10 DSPs timeout (discarded)
- Circuit breaker trips if >50% fail
- Fallback to guaranteed inventory end
Diagram explanation - RTB Auction Flow:
Step 1 (Construct BidRequest): Ad Server builds OpenRTB 2.x compliant BidRequest containing:
- User demographics (age, gender, interests) - privacy-compliant hashed IDs
- Device info (iOS/Android, screen size)
- Ad slot specifications (320×50 banner, minimum bid floor $2.50)
- Maximum response time (tmax=30ms)
Steps 2a-2b (Parallel Broadcast to DSPs): Ad Server simultaneously sends BidRequests to 50+ DSPs using:
- HTTP/2 multiplexing: Single persistent connection per DSP, multiple concurrent requests
- Connection pooling: Reuse TCP connections to avoid handshake overhead (saves ~50ms per cold start)
- Geographic distribution: Regional RTB services in US-East, EU, APAC to minimize network latency
Steps 3a-3b (DSP Responses): Each DSP independently:
- Evaluates user value based on their own ML models
- Checks advertiser budgets and campaign constraints
- Runs internal auction across multiple advertisers
- Returns highest bid + creative ID (or no-bid if user doesn’t match)
Typical response rate: 80% of DSPs respond within 30ms, 20% timeout
Step 4 (Timeout Enforcement): After exactly 30ms, Ad Server:
- Closes bidding window (hard timeout)
- Discards any late responses (prevents auction manipulation)
- Typically collects 40-45 valid bids from 50 DSPs
Step 5 (Combine with ML CTR): Ad Server merges:
- External bids from DSPs (price offered)
- Internal ML CTR predictions (likelihood of click)
- This combination enables revenue optimization: bid × CTR = expected revenue
Steps 6-8 (GSP Auction): Auction logic executes Generalized Second-Price auction:
- Compute eCPM: For each bid, calculate effective CPM = bid × CTR × 1000
- Sort: Rank all bids by eCPM descending
- Select winner: Highest eCPM wins the impression
- Calculate price: Winner pays minimum needed to beat 2nd place (second-price auction)
- Example: Winner bid $5.50 with CTR 0.15 (eCPM $825)
- Second place eCPM $600
- Winner pays: $600 / (0.15 × 1000) = $4.00 (less than their $5.50 bid)
Step 9 (Return Winner): Auction returns winning DSP ID and computed price to Ad Server
Step 10 (Win Notification): Ad Server sends async win notification to winning DSP (best-effort delivery):
- Creative URL to render
- Final price charged
- Impression ID for tracking
Failure Handling (bottom note):
- Timeouts (20%): 10 DSPs out of 50 don’t respond in time → proceed with 40 bids
- Circuit breaker: If >50% of DSPs fail for 1 minute, trip circuit and fallback to guaranteed inventory
- Fallback: Serve house ads or guaranteed campaigns to avoid blank space
BidRequest (simplified):
{
"id": "req_12345",
"imp": [{
"id": "1",
"banner": {"w": 320, "h": 50},
"bidfloor": 2.50
}],
"user": {
"id": "hashed_user_id",
"geo": {"country": "USA"}
},
"tmax": 30
}
Timeout Handling: Adaptive Strategy
Maintain per-DSP latency histograms (H_{dsp}). Set per-DSP timeout: $$T_{dsp} = \text{min}\left(P_{95}(H_{dsp}), 30ms\right)$$
Progressive Auction:
- Preliminary auction at 20ms with available bids
- Update with late arrivals up to 30ms if they beat current winner
- Advantage: Low latency for fast DSPs, opportunity for high-value slow bids
HTTP/2 Connection Pooling
Connection pool sizing: $$P = \frac{Q \times L}{N}$$
Example: 1000 QPS, 30ms latency, 10 servers → 3 connections/server
HTTP/2 benefits:
- Single connection, multiple concurrent requests
- HPACK compression reduces overhead ~70%
- Saves ~50ms cold start time per request
Geographic Distribution
Network latency bounded by speed of light: $$T_{propagation} \geq \frac{d}{c \times 0.67}$$
Example: NY to London (5,585 km) ≈ 28ms propagation - nearly entire RTB budget!
Solution: Regional deployment in US-East, US-West, EU, APAC reduces max distance from 10,000km to ~1,000km: $$T_{propagation} \approx 5ms$$
Savings: 23ms enables including 10+ additional DSPs within budget.
Part 4: ML Inference Pipeline
Feature Engineering Architecture
Features fall into three categories with different freshness requirements:
Real-Time (computed per request):
- User context: time, location, device
- Session features: current browsing, last N actions
Near-Real-Time (pre-computed, 10s TTL):
- User interests: aggregated from last 24h activity
- Ad performance: click rates (last hour)
Batch (pre-computed daily):
- User segments: demographic clusters
- Long-term CTR: 30-day aggregated performance
The feature engineering architecture supports three distinct freshness tiers, each optimized for different latency and accuracy requirements:
graph TB subgraph "Real-Time Feature Pipeline" REQ[Ad Request] --> PARSE[Request Parser] PARSE --> CONTEXT[Context Features
time, location, device
Latency: 5ms] PARSE --> SESSION[Session Features
user actions
Latency: 10ms] end subgraph "Feature Store" CONTEXT --> MERGE[Feature Vector Assembly] SESSION --> MERGE REDIS_RT[(Redis
Near-RT Features
TTL: 10s)] --> MERGE REDIS_BATCH[(Redis
Batch Features
TTL: 24h)] --> MERGE end subgraph "Stream Processing Pipeline" EVENTS[User Events
clicks, views] --> KAFKA[Kafka
100K events/sec] KAFKA --> FLINK[Flink
Windowed Aggregation
last hour CTR] FLINK --> REDIS_RT end subgraph "Batch Processing Pipeline" S3[S3 Data Lake
Historical Data] --> SPARK[Spark Jobs
Daily] SPARK --> FEATURE_GEN[Feature Generation
User segments, 30-day CTR] FEATURE_GEN --> REDIS_BATCH end MERGE --> INFERENCE[ML Inference
TensorFlow Serving
Latency: 40ms] INFERENCE --> PREDICTION[CTR Prediction
0.0 - 1.0] classDef realtime fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef store fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef stream fill:#ffe0b2,stroke:#e65100,stroke-width:2px classDef batch fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef inference fill:#fff9c4,stroke:#fbc02d,stroke-width:3px class REQ,PARSE,CONTEXT,SESSION realtime class REDIS_RT,REDIS_BATCH,MERGE store class EVENTS,KAFKA,FLINK stream class S3,SPARK,FEATURE_GEN batch class INFERENCE,PREDICTION inference
Diagram explanation:
Real-Time Feature Pipeline (top-left): When an ad request arrives, the Request Parser extracts immediate context (time of day, device type, location) in ~5ms and session features (recent browsing history) in ~10ms. These features have zero staleness but limited richness.
Stream Processing Pipeline (bottom-left): User events (clicks, views) flow through Kafka at 100K events/sec to Flink, which computes windowed aggregations like “user’s CTR in the last hour.” These features update every 10 seconds (TTL) in Redis, balancing freshness with computational cost.
Batch Processing Pipeline (bottom-center): Daily Spark jobs process historical data from S3 to generate complex features like user demographic segments and 30-day CTR averages. These features update daily (TTL: 24h) but provide deep insights not possible in real-time.
Feature Store (center): The Feature Vector Assembly component merges all three tiers - real-time context, near-real-time behavioral signals, and batch-computed segments - into a single 150-dimensional vector. This assembled vector feeds ML Inference.
ML Inference (right): TensorFlow Serving receives the 150-dim feature vector and predicts CTR in <40ms, the critical path bottleneck in our system.
Feature Freshness Guarantees:
- Batch features: (t_{fresh} \leq 24h)
- Stream features: (t_{fresh} \leq 100ms)
- Real-time features: (t_{fresh} = 0) (computed per request)
Latency SLA: $$P(\text{FeatureLookup} \leq 10ms) \geq 0.99$$
Achieved with:
- Redis p99 latency: 5ms
- Feature vector assembly: 2ms
Feature Vector Construction
$$x = [x_{user}, x_{ad}, x_{context}, x_{cross}]$$
- User features (50-dim): demographics, interests, historical CTR
- Ad features (30-dim): creative type, category, global CTR
- Context features (20-dim): time, device, placement
- Cross features (50-dim): user-ad interactions, interest alignment
Total: 150 features
Model Architecture Decision
Technology Selection: ML Model Architecture
For CTR prediction within a 40ms latency budget, the model family choice balances accuracy, inference speed, and operational complexity:
Model Family | Inference Latency | Infrastructure | Accuracy | Operational Complexity |
---|---|---|---|---|
Tree-based (GBDT) | 5-15ms (CPU) | Standard compute | High (0.78-0.82 AUC) | Low - feature importance aids debugging |
Deep Learning (DNN) | 20-40ms (GPU) | GPU clusters | Highest (0.80-0.84 AUC) | High - black box, slow iteration |
Linear (FM/FFM) | 3-8ms (CPU) | Standard compute | Moderate (0.75-0.78 AUC) | Low - interpretable weights |
Selection rationale: Tree-based models (LightGBM/XGBoost) offer the best trade-off—meeting latency requirements on CPU infrastructure while delivering strong accuracy. Deep learning provides 2-3% AUC gains but requires GPU infrastructure (50-100× cost increase at 1M+ QPS) and exceeds our latency budget. Linear models are fastest but sacrifice too much accuracy for the marginal latency improvement.
Framework Comparison for Production Serving:
Framework | Latency (p99) | Throughput | GPU Support | Deployment | Best For |
---|---|---|---|---|---|
TensorFlow Serving | 15-25ms | High (batching) | Excellent | gRPC/REST | DNN models, GPU inference |
ONNX Runtime | 8-15ms | Very High | Good | Embedded/REST | Cross-framework, CPU optimization |
Triton Inference Server | 10-20ms | Highest | Excellent | Multi-model batching | Heterogeneous GPU workloads |
Native (LightGBM/XGBoost) | 5-12ms | High | Limited | Custom service | Tree models, lowest latency |
Deployment approach: For GBDT models, native LightGBM serving wrapped in a lightweight gRPC service provides optimal latency. For future DNN models, TensorFlow Serving offers production-grade features (model versioning, A/B testing, request batching) with acceptable overhead. ONNX Runtime serves as a middle ground, offering hardware acceleration across CPU/GPU with framework portability.
Scaling Strategy:
Horizontal autoscaling based on request throughput using Kubernetes HPA. Target utilization: 70-80% to handle traffic spikes. GPU workloads require careful node selection to avoid scheduling on CPU-only nodes. Request batching (accumulate 5-10ms, batch size 32-64) amortizes overhead and improves GPU utilization by 3-5×.
Optimization techniques: Model quantization (FP32 → INT8) reduces memory footprint by 4× and improves throughput by 2-3× with minimal accuracy loss (<1% AUC degradation). Critical for cost efficiency at scale.
Part 5: Distributed Caching
Multi-Tier Cache Technology Selection
To achieve 95%+ cache hit rate with sub-10ms latency, I need three cache tiers with different characteristics:
L1 Cache (In-Process) Options:
Technology | Latency | Throughput | Memory | Pros | Cons |
---|---|---|---|---|---|
Caffeine (JVM) | ~1μs | 10M ops/sec | In-heap | Window TinyLFU eviction, lock-free reads | JVM-only, GC pressure |
Guava Cache | ~2μs | 5M ops/sec | In-heap | Simple API, widely used | LRU only, lower hit rate |
Ehcache | ~1.5μs | 8M ops/sec | In/off-heap | Off-heap option reduces GC | More complex configuration |
Decision: Caffeine - Superior eviction algorithm (Window TinyLFU) yields 10-15% higher hit rates than LRU-based alternatives. Benchmarks show 2× throughput vs Guava.
L2 Cache (Distributed) Options:
Technology | Latency (p99) | Throughput | Clustering | Data Structures | Pros | Cons |
---|---|---|---|---|---|---|
Redis Cluster | 5ms | 100K ops/sec/node | Native sharding | Rich (lists, sets, sorted sets) | Lua scripting, atomic ops | More memory than Memcached |
Memcached | 3ms | 150K ops/sec/node | Client-side sharding | Key-value only | Lower memory, simpler | No atomic ops, no persistence |
Hazelcast | 8ms | 50K ops/sec/node | Native clustering | Rich data structures | Java integration | Higher latency, less mature |
Decision: Redis Cluster
- Need atomic operations for budget counters (DECRBY, INCRBY)
- Complex data structures for ad metadata (sorted sets for recency, hashes for attributes)
- Persistence for crash recovery (avoid cold cache startup)
- Trade-off accepted: 30% higher memory usage vs Memcached for operational simplicity
Memcached typically costs ~30% less than Redis for equivalent capacity, but Redis offers atomic operations and richer data structures that justify the premium for this use case.
L3 Persistent Store Options:
Technology | Read Latency (p99) | Write Throughput | Scalability | Consistency | Pros | Cons |
---|---|---|---|---|---|---|
Cassandra | 20ms | 500K writes/sec | Linear (peer-to-peer) | Tunable (CL=QUORUM) | Multi-DC, no SPOF | No JOINs, eventual consistency |
PostgreSQL | 15ms | 50K writes/sec | Vertical + sharding | ACID transactions | SQL, JOINs, strong consistency | Manual sharding complex |
MongoDB | 18ms | 200K writes/sec | Horizontal sharding | Tunable | Flexible schema | Less mature than Cassandra |
DynamoDB | 10ms | 1M writes/sec | Fully managed | Eventual/strong | Auto-scaling, no ops | Vendor lock-in, cost at scale |
Decision: Cassandra
- Scale requirement: 400M users → 4TB+ user profiles
- Write-heavy: User profile updates, ad impression logs
- Multi-region: Built-in multi-datacenter replication (RF=3)
- Linear scalability: Add nodes without rebalancing entire cluster
PostgreSQL limitation: Vertical scaling hits ceiling around 50-100TB. Sharding (e.g., Citus) adds operational complexity comparable to Cassandra, but without peer-to-peer resilience.
DynamoDB cost analysis: At 8B requests/day: $$\text{Cost}_{DynamoDB} = \frac{8 \times 10^9}{1000} \times \$1.25 \times 10^{-6} \approx \$10K/day = \$300K/month$$
Cassandra on 100 nodes @ $500/node/month: $50K/month (6× cheaper at this scale).
Three-Tier Cache Hierarchy
graph TB REQ[Cache Request
user_id: 12345] REQ --> L1[L1: Caffeine JVM Cache
10-second TTL
~1μs lookup
100MB per server] L1 --> L1_HIT{Hit?} L1_HIT -->|60% Hit| RESP1[Response ~1μs] L1_HIT -->|40% Miss| L2[L2: Redis Cluster
30-second TTL
~5ms lookup
1000 nodes × 16GB] L2 --> L2_HIT{Hit?} L2_HIT -->|35% Hit| POPULATE_L1[Populate L1] POPULATE_L1 --> RESP2[Response ~5ms] L2_HIT -->|5% Miss| L3[L3: Cassandra Ring
Multi-DC Replication
~20ms read
Petabyte scale] L3 --> POPULATE_L2[Populate L2 + L1] POPULATE_L2 --> RESP3[Response ~20ms] MONITOR[Stream Processor
Flink/Spark
Count-Min Sketch] -.->|0.1% sampling| L2 MONITOR -.->|Detect hot keys| REPLICATE[Dynamic Replication
3x copies for hot keys] REPLICATE -.->|Replicate to nodes| L2 PERF[Total Hit Rate: 95%
Average Latency: 2.75ms
p99 Latency: 25ms] classDef l1cache fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef l2cache fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef l3cache fill:#ffe0b2,stroke:#e65100,stroke-width:2px classDef response fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef monitor fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef stats fill:#e0f7fa,stroke:#00838f,stroke-width:3px class REQ,RESP1,RESP2,RESP3 response class L1,L1_HIT l1cache class L2,L2_HIT l2cache class L3 l3cache class MONITOR,REPLICATE monitor class PERF stats class POPULATE_L1,POPULATE_L2 response
Diagram explanation:
L1 (In-Process Cache): Caffeine cache stores the hottest 10% of data (100MB per server) with 10-second TTL. Sub-microsecond lookups serve 60% of requests. When L1 misses, the request cascades to L2.
L2 (Distributed Cache): Redis Cluster with 1000 nodes (16GB each) stores 20% of total data with 30-second TTL. Serves 35% of total requests (87.5% of L1 misses) in ~5ms. Consistent hashing distributes keys across nodes.
L3 (Persistent Store): Cassandra provides the source of truth with multi-datacenter replication (RF=3). Handles remaining 5% of requests in ~20ms. On cache miss, data populates both L2 and L1 for subsequent requests.
Hot Key Detection: Stream processor samples 0.1% of traffic using Count-Min Sketch to detect celebrity users generating 100× normal load. Hot keys are dynamically replicated 3× across Redis nodes to prevent single-node saturation.
Overall Performance: 95% total hit rate with 2.75ms average latency, meeting our <10ms SLA.
Hit Rate Calculation: $$H_{total} = H_1 + (1-H_1)H_2 + (1-H_1)(1-H_2)H_3$$ $$= 0.60 + 0.40 \times 0.35 + 0.40 \times 0.65 \times 1.0 = 0.95$$
Average Latency: $$\mathbb{E}[L] = 0.60 \times 0.001 + 0.35 \times 5 + 0.05 \times 20 = 2.75ms$$
Redis Cluster Sharding
Hash slot assignment: $$\text{slot}(k) = \text{CRC16}(k) \mod 16384$$
1000 nodes, 16,384 slots → ~16 slots per node
Load variance with virtual nodes: $$\text{Var}[\text{load}] = \frac{\mu}{n \times v}$$
1000 QPS, 1000 nodes, 16 virtual nodes → standard deviation ≈ 25% of mean
Hot Partition Mitigation: Count-Min Sketch
Why Count-Min Sketch?
Track frequency of millions of user IDs without gigabytes of memory.
Approach | Memory | Accuracy | Use Case |
---|---|---|---|
Exact HashMap | O(n) = GBs | 100% | Too expensive |
Sampling | Minimal | Misses rare keys | Ineffective |
Count-Min Sketch | 5.4 KB | 99% ± 1% | Perfect fit |
Configuration: width=272, depth=5
Error bound: $$\text{estimate}(k) \leq \text{true}(k) + \frac{e \times N}{w}$$
For 100M requests/hour: error <30%, sufficient for detecting >1000 req/sec hot keys.
Dynamic Replication: When frequency >100 req/sec:
- Replicate key to (R=3) nodes
- Update routing table
- Client load-balances reads across replicas
Per-replica load: (\lambda_{replica} = \lambda / R = 1000 / 3 ≈ 333) req/sec
Workload Isolation
Prevent batch workloads from interfering with serving:
Cassandra data-center aware replication:
CREATE KEYSPACE user_profiles
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us_east': 2, // Serving replicas
'eu_central': 1 // Batch replica
};
Batch writes: CONSISTENCY LOCAL_ONE
(EU-Central)
Serving reads: CONSISTENCY LOCAL_QUORUM
(US-East)
Cost: Dedicate 33% of replicas to batch, but serving latency unaffected.
Cache Invalidation
Hybrid Strategy:
- Normal updates: TTL=30s (passive)
- Critical updates (GDPR opt-out): Active invalidation via Kafka (15ms propagation)
- Version metadata: For tracking update history
Part 6: Auction Mechanisms
Generalized Second-Price (GSP)
Effective bid (eCPM): $$\text{eCPM}_i = b_i \times \text{CTR}_i \times 1000$$
Winner selection: $$w = \arg\max_{i} \text{eCPM}_i$$
Price (second-price): $$p_w = \frac{\text{eCPM}_{2nd}}{\text{CTR}_w \times 1000} + \epsilon$$
Example:
Advertiser | Bid | CTR | eCPM | Rank |
---|---|---|---|---|
A | 5.00 | 0.10 | 500 | 2 |
B | 4.00 | 0.15 | 600 | 1 ← Winner |
C | 6.00 | 0.05 | 300 | 3 |
Winner B pays: (\frac{500}{0.15 \times 1000} = 3.33) (bid 4.00 but only pays enough to beat A)
Game Theory: GSP vs VCG
GSP is not truthful (proven by counterexample in auction theory literature). Advertisers can increase profit by bidding below true value. However:
Why use GSP?
- Higher revenue than VCG in practice
- Simpler to explain
- Industry standard (Google Ads)
- Nash equilibrium exists
Computational Complexity:
- GSP: (O(N \log N)) (sort + select)
- VCG: (O(N^2 \log N)) (counterfactual allocations)
For 50 DSPs: GSP=282 ops, VCG=14,100 ops
Recommendation: GSP for real-time, VCG for offline optimization.
Reserve Price Optimization
Optimal reserve price (Myerson 1981): $$r^* = \arg\max_r \left[ r \times \left(1 - F(r)\right) \right]$$
Balance: high reserve → more revenue/ad but fewer filled; low reserve → more filled but lower revenue/ad.
For uniform distribution (F(v) = v/v_{max}): $$r^* = \frac{v_{max}}{2}$$
Empirical approach: Use 50th percentile (median) of historical bids as reserve.
Part 7: Advanced Topics
Budget Pacing: Distributed Spend Control
Problem: Advertisers set daily budgets (e.g., $10,000/day). In a distributed system serving 1M QPS, how do we prevent over-delivery without centralizing every spend decision?
Challenge:
Centralized approach (single database tracks spend):
- Latency: ~10ms per spend check
- Throughput bottleneck: ~100K QPS max
- Single point of failure
Solution: Pre-Allocation with Periodic Reconciliation
graph TD ADV[Advertiser X
Daily Budget: $10,000] ADV --> BUDGET[Budget Controller] BUDGET --> REDIS[(Redis
Atomic Counters)] BUDGET --> POSTGRES[(PostgreSQL
Audit Log)] BUDGET -->|Allocate $50| AS1[Ad Server 1] BUDGET -->|Allocate $75| AS2[Ad Server 2] BUDGET -->|Allocate $100| AS3[Ad Server 3] AS1 -->|Spent: $42
Return: $8| BUDGET AS2 -->|Spent: $68
Return: $7| BUDGET AS3 -->|Spent: $95
Return: $5| BUDGET BUDGET -->|Hourly reconciliation| POSTGRES TIMEOUT[Timeout Monitor
5min intervals] -.->|Release stale
allocations| REDIS REDIS -->|Budget < 10%| THROTTLE[Dynamic Throttle] THROTTLE -.->|Reduce allocation
size $100→$10| BUDGET classDef advertiser fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef controller fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef server fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef storage fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px classDef monitor fill:#ffe0b2,stroke:#e65100,stroke-width:2px class ADV advertiser class BUDGET,THROTTLE controller class AS1,AS2,AS3 server class REDIS,POSTGRES storage class TIMEOUT monitor
Diagram explanation:
Budget Controller (center): Manages the advertiser’s $10,000 daily budget across hundreds of ad servers. Uses Redis for fast atomic operations and PostgreSQL for durable audit logs.
Pre-Allocation (Budget → Ad Servers): Instead of checking budget on every impression, ad servers request budget chunks upfront (e.g., $50-$100). The Budget Controller atomically decrements the remaining budget using Redis DECRBY
, preventing race conditions across distributed servers.
Return Unused Budget (Ad Servers → Budget): After serving ads, servers report actual spend. If they allocated $100 but only spent $95, they return $5 using Redis INCRBY
. This prevents over-locking budget.
Timeout Monitor: Every 5 minutes, scans for allocations older than 10 minutes with no spend report. These likely represent crashed servers holding budget hostage. Automatically returns their allocations to the pool.
Dynamic Throttle: When budget drops below 10%, reduces allocation chunk size from $100 → $10. This prevents over-delivery when approaching budget exhaustion.
PostgreSQL Audit Log: All allocations and spends are logged to PostgreSQL for billing reconciliation and debugging, providing a durable audit trail.
Budget Allocation Algorithm:
The core algorithm has three operations:
1. Request Allocation:
- Ad server requests budget chunk (e.g., $100) from Budget Controller
- Controller atomically decrements remaining budget using Redis
DECRBY
(prevents race conditions) - If budget is low (<10% remaining), reduce allocation size to prevent over-delivery
- Log allocation to PostgreSQL for audit trail
- Return allocated amount (or 0 if budget exhausted)
2. Report Spend:
- Ad server reports actual spend after serving ads
- If
actual < allocated
, return unused portion via RedisINCRBY
- Log actual spend to PostgreSQL for billing reconciliation
- Example: Allocated $100, spent $87 → return $13 to budget pool
3. Timeout Monitor (Background):
- Every 5 minutes, scan for allocations older than 10 minutes with no spend report
- These likely represent crashed servers holding budget hostage
- Automatically return their allocations to the budget pool via
INCRBY
- Prevents budget being permanently locked by failed servers
Key design decisions:
- Redis for speed: Atomic counters provide strong consistency with <1ms latency
- PostgreSQL for durability: Audit log enables billing reconciliation and debugging
- Pre-allocation strategy: Servers get budget chunks upfront, avoiding per-request latency
- Dynamic sizing: Reduce allocation chunks when budget is low to minimize over-delivery risk
Over-delivery bound: $$\text{OverDelivery}_{max} = S \times A$$
Where (S) = number of servers, (A) = allocation per server
Example: 100 servers × 1% daily budget allocation = 100% max over-delivery (entire daily budget)
Mitigation: When budget <10%, reduce allocation: $$A_{new} = \frac{B_r}{S \times 10}$$
Reduces max over-delivery to ~1%.
Fraud Detection: Multi-Tier Filtering
Problem: Detect and block fraudulent ad clicks in real-time without adding significant latency.
Fraud Types:
- Click Farms: Bots or paid humans generating fake clicks
- SDK Spoofing: Fake app installations reporting ad clicks
- Domain Spoofing: Fraudulent publishers misrepresenting site content
- Ad Stacking: Multiple ads layered, only top visible but all “viewed”
Detection Strategy: Multi-Tier Filtering
graph TB REQ[Ad Request/Click] --> L1{L1: Bloom Filter
Simple Rules
0ms overhead} L1 -->|Known bad IP| BLOCK1[Block
Bloom Filter] L1 -->|Pass| L2{L2: Behavioral
5ms latency} L2 -->|Suspicious pattern| PROB[Probabilistic Block
50% traffic] L2 -->|Pass| L3{L3: ML Model
Async, 20ms} L3 -->|Fraud score > 0.8| BLOCK3[Post-Facto Block
Refund advertiser] L3 -->|Pass| SERVE[Serve Ad] PROB --> SERVE SERVE -.->|Log for analysis| OFFLINE[Offline Analysis
Update models] OFFLINE -.->|Update rules| L1 OFFLINE -.->|Retrain model| L3 classDef request fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef filter fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef block fill:#ffebee,stroke:#d32f2f,stroke-width:2px classDef warning fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef success fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef async fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px class REQ request class L1,L2,L3 filter class BLOCK1,BLOCK3 block class PROB warning class SERVE success class OFFLINE async
Diagram explanation:
L1 (Bloom Filter): Every request first checks against a Bloom filter containing 100K known fraudulent IPs. This adds virtually zero latency (~1 microsecond) and immediately blocks obvious repeat offenders. The filter has 0.01% false positive rate but zero false negatives - we never accidentally allow known fraudsters.
L2 (Behavioral Heuristics): Requests that pass L1 enter behavioral scoring with <5ms overhead. This synchronous layer uses simple rules to score suspicious patterns without ML complexity.
L3 (ML Model): After serving the ad (to maintain low latency), an async ML model analyzes the request with 50+ features. High fraud scores (>0.8) trigger post-facto blocking and advertiser refunds.
Offline Analysis: Logs from all tiers feed back into model retraining and rule updates, creating a continuous improvement loop.
L1: Bloom Filter for IP Blocklist
Why Bloom filters are ideal for fraud IP blocking:
At 1M+ QPS, we need to check every incoming request against a blocklist of known fraudulent IPs. A naive hash table would require significant memory (several MB per server) and introduce cache misses.
Bloom filters solve this perfectly:
Space efficiency: 10⁷ bits (only 1.25 MB) can represent 100K IPs with 0.01% false positive rate. That’s 100× more space-efficient than a hash table storing full IP addresses. The entire data structure fits in L2 cache, enabling sub-microsecond lookups.
Zero false negatives: If a Bloom filter says an IP is NOT in the blocklist, it’s guaranteed correct. We never accidentally allow known fraudsters through. The only risk is false positives (0.01% chance of blocking a legitimate IP), which is acceptable - we have L2/L3 filters to catch legitimate users.
Constant-time lookups: O(k) hash operations regardless of set size. With k=7 hash functions, we can check membership in <1 microsecond - adding virtually zero latency to the request path.
Lock-free reads: Multiple threads can query simultaneously without contention. Critical for handling 1M+ QPS across hundreds of cores.
L2: Behavioral Heuristics (<5ms overhead)
Scoring (0-1 scale, >0.6 triggers blocking):
- Click rate anomaly (40% weight): >30 clicks/min = bot behavior
- User agent switching (20% weight): >5 different agents from same IP = suspicious
- Geographic inconsistency (15% weight): US IP + Chinese browser language = potential fraud
- Abnormal CTR (25% weight): >50% CTR (clicking every ad) = clearly fake
Why not ML here? L2 must be synchronous (runs before ad serving). ML models (20ms) run async post-serving to avoid adding latency.
L3: ML Fraud Score (Async)
GBDT model with 50+ features:
- Device fingerprint entropy: Low entropy = scripted behavior
- Click timestamp distribution: Uniform intervals → bot, bursty → human
- Network ASN: Hosting provider vs ISP (data centers are suspicious)
- Session depth: Fraud typically has shallow sessions
Handling class imbalance (1% fraud):
- Adjust class weights: fraud=99, legitimate=1 in loss function
- SMOTE oversampling: generate synthetic fraud examples to balance training data
Economic trade-off: $$\text{Cost} = C_{fp} \times FP + C_{fn} \times FN$$
Where (C_{fn} \gg C_{fp}) (advertiser refund ≫ lost revenue from blocking legitimate), we optimize for low false negatives (catch fraud even if we occasionally block legitimate traffic).
Multi-Region Failover
Timeline when US-East (60% traffic) fails:
T+10s: Load balancer detects failure T+30s: US-West sees 3x traffic, CPU 40%→85% T+60s: Auto-scaling triggered T+90s: Cache hit rate drops 60%→45% (different user distribution) T+150s: New instances online, capacity restored
Queuing Theory Insight:
Server utilization: (\rho = \lambda / (c \mu))
When (\rho > 0.8), queue length grows exponentially. When (\rho \geq 1.0), system is unstable.
Example:
- Normal: 10K QPS, 100 servers (150 QPS capacity) → (\rho = 0.67) ✓
- Failure: 30K QPS, 120 servers → (\rho = 1.67) ✗ Overload
Mitigation: Graceful Degradation
- Increase cache TTL: 30s → 300s
- Disable expensive features (skip ML, use rule-based targeting)
- Load shedding: Reject low-value traffic
Load Shedding Strategy:
When utilization >90%:
- Estimate revenue per request: (\text{CTR} \times \text{bid})
- Calculate P95 threshold
- Always accept high-value (>P95)
- Probabilistically reject 50% of low-value
Impact: Preserves ~97.5% revenue while shedding 47.5% traffic.
Standby capacity calculation: $$c_{standby} = c_{normal} \times \left(\frac{\lambda_{spike}}{\lambda_{normal}} - 1\right)$$
For 3× spike: 100 servers × (3-1) = 200 servers (200% over-provisioning)
Cost: Expensive. Alternative: accept degraded performance during rare failures.
Part 8: Observability and Operations
Service Level Indicators and Objectives
Key SLIs:
Service | SLI | Target | Why |
---|---|---|---|
Ad API | Availability | 99.9% | Revenue tied to successful serves |
Ad API | Latency | p95 <100ms, p99 <150ms | Mobile timeouts above 150ms |
ML | Accuracy | AUC >0.78 | Below 0.75 = 15%+ revenue drop |
RTB | Response Rate | >80% DSPs within 30ms | <80% = remove from rotation |
Budget | Consistency | Over-delivery <1% | >5% = refunds + complaints |
Error Budget Policy (99.9% = 43 min/month):
When budget exhausted:
- Freeze feature launches (critical fixes only)
- Focus on reliability work
- Mandatory root cause analysis
- Next month: 99.95% target to rebuild trust
Incident Response Dashboard
When ML latency alert fires at 2 AM, show:
Top-line (what’s broken):
- Current p99 latency vs SLO (145ms vs <80ms)
- Error rate vs SLO (2.3% vs <0.1%)
Resource utilization (service health):
- GPU utilization (65% = normal)
- Model version loaded (verify correct version)
Dependency latencies (find bottleneck):
- Feature fetch: 95ms p99 ← SPIKE from 5ms baseline
- Model inference: 40ms (normal)
Downstream health:
- Cassandra: disk_io_wait, pending_compactions, read_latency
Historical context:
- Related incidents: INC-1247 (2 weeks ago - Cassandra compaction)
Auto-suggested runbook:
- Check
nodetool cfstats
for compactions - If storm detected:
nodetool stop compaction
- Tune
compaction_throughput_mb_per_sec
Distributed Tracing
Single user reports “ad not loading” among 1M+ req/sec:
Request ID: 7f3a8b2c...
Total: 287ms (VIOLATED SLO)
├─ API Gateway: 2ms
├─ User Profile: 45ms
│ └─ Redis: 43ms (normally 5ms)
│ └─ TCP timeout: 38ms
│ └─ Cause: Node failure, awaiting replica
├─ ML Inference: 156ms
│ └─ Batch incomplete: 8/32
│ └─ Cause: Low traffic (Redis failure reduced QPS)
└─ RTB: 84ms
Root cause: Redis node failure → cascading slowdown. Trace shows exactly why.
Security and Compliance
Service-to-Service Auth (mTLS):
# Istio PeerAuthentication
mtls:
mode: STRICT # Reject non-mTLS
# Authorization: Ad Server can't call Budget DB directly
principals: ["cluster.local/ns/prod/sa/ad-server"]
paths: ["/predict"] # ML inference only
PII Protection:
- Encryption at rest: KMS-encrypted Cassandra columns
- Column-level encryption: Only ML pipeline has decrypt permission
- Data minimization: Hashed user IDs, no email/name in ad requests
- Log scrubbing:
user_id=[REDACTED]
Secrets: Vault with Dynamic Credentials
- Lease credentials auto-rotated every 24h
- Audit log: which service accessed what when
- Revoke access instantly if compromised
ML Data Poisoning Protection:
graph TD RAW[Raw Events] --> VALIDATE{Validation} VALIDATE --> CHECK1{CTR >3σ?} VALIDATE --> CHECK2{Low IP entropy?} VALIDATE --> CHECK3{Uniform timing?} CHECK1 -->|Yes| QUARANTINE[Quarantine] CHECK2 -->|Yes| QUARANTINE CHECK3 -->|Yes| QUARANTINE CHECK1 -->|No| CLEAN[Training Data] CLEAN --> TRAIN[Training] --> SIGN[GPG Sign] SIGN --> REGISTRY[Model Registry] REGISTRY --> INFERENCE[Inference] INFERENCE --> VERIFY{Verify GPG} VERIFY -->|Valid| LOAD[Load Model] VERIFY -->|Invalid| REJECT[Reject + Alert] classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef check fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef danger fill:#ffebee,stroke:#d32f2f,stroke-width:2px classDef safe fill:#e8f5e9,stroke:#4caf50,stroke-width:2px classDef process fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px class RAW input class VALIDATE,CHECK1,CHECK2,CHECK3,VERIFY check class QUARANTINE,REJECT danger class CLEAN,LOAD safe class TRAIN,SIGN,REGISTRY,INFERENCE process
Three validation checks:
- CTR anomaly: >3σ spike (e.g., 2%→8%)
- IP entropy: <2.0 (clicks from narrow range = botnet)
- Temporal pattern: Uniform intervals (human=bursty, bot=mechanical)
Model integrity: GPP signature prevents loading tampered models even if storage compromised.
Data Lifecycle and GDPR
Retention policies:
Data | Retention | Rationale |
---|---|---|
Raw events | 7 days | Real-time only; archive to S3 |
Aggregated metrics | 90 days | Dashboard queries |
Model training data | 30 days | Older data less predictive |
User profiles | 365 days | GDPR; inactive purged |
Audit logs | 7 years | Legal compliance |
GDPR “Right to be Forgotten”:
Deletion across 10+ systems in parallel:
- Cassandra: DELETE user_profiles
- Redis: FLUSH user keys
- Kafka: Publish tombstone (log compaction)
- ML training: Mark deleted
- Backups: Crypto erasure (delete encryption key)
Verification: All systems confirm → send deletion certificate to user within 48h.
Part 9: Production Operations at Scale
Progressive Rollout Strategy
graph TD CODE[Main Branch] --> BUILD[CI Build] BUILD --> CANARY{Canary
1%, 15min} CANARY -->|OK| STAGE1[10%
US-East, 30min] CANARY -->|Fail| ROLLBACK1[Auto-rollback] STAGE1 -->|OK| STAGE2[50%
US-East+West, 1hr] STAGE1 -->|Fail| ROLLBACK2[Auto-rollback] STAGE2 -->|OK| STAGE3[100%
All regions, 24hr] STAGE2 -->|Fail| ROLLBACK3[Auto-rollback] STAGE3 --> COMPLETE[Complete] classDef build fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef stage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px classDef rollback fill:#ffebee,stroke:#d32f2f,stroke-width:2px classDef success fill:#e8f5e9,stroke:#4caf50,stroke-width:3px classDef decision fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px class CODE,BUILD build class STAGE1,STAGE2,STAGE3 stage class ROLLBACK1,ROLLBACK2,ROLLBACK3 rollback class COMPLETE success class CANARY decision
Automated gates:
- Error rate: <0.1% increase
- Latency: p95 <110ms, p99 <150ms
- Revenue drop: <2%
- Circuit breakers: no new opens
- Database health: latency within range
Feature flags for blast radius control:
ml_inference_v2:
enabled: true
rollout: 5%
targeting: ["internal_testing"]
kill_switch: true
Benefits:
- Deploy code dark, enable gradually
- Instant rollback via flag (no redeploy)
- A/B testing built-in
Deep Observability: Error Budgets in Practice
Error budget example (99.9% = 43 min/month):
Burn rate alert: If consuming >10× rate, outage in <3 hours → page on-call.
Why 99.9% not 99.99%?
- 99.99% requires multi-region active-active (2× infrastructure cost)
- Business impact analysis:
- 99.9% downtime: ~43 min/month
- 99.99% downtime: ~4 min/month
- ROI calculation: 2× infrastructure cost vs 10× reduction in downtime
- At current scale, incremental revenue gain doesn’t justify doubling infrastructure spend
Cost Management at Scale
Resource attribution by team:
Resource | Owner | Charge-back Model |
---|---|---|
Redis (1000 nodes) | Platform | GB-hours per service |
Cassandra (500 nodes) | Data | Storage + IOPS per table |
ML GPUs (100× A100) | ML | GPU-hours (training vs inference) |
Kafka (200 brokers) | Data | GB ingested + stored per topic |
K8s (5000 pods) | Compute | vCPU-hours + RAM-hours per namespace |
Cost optimization:
- Right-sizing: CPU usage should average 70% (not 30%)
- Spot instances for ML training: 70% cheaper, tolerate interruptions
- Tiered storage: Hot (7d) → Warm (30d) → Cold (365d) = 67% savings
- Reserved instances: 60% baseline reserved = 0.6× rate, 40% on-demand
Relative cost relationships:
- Memory-optimized: 1.5-2× baseline
- GPU instances: 10-30× baseline
- NVMe storage: 3-5× standard disk
- Network egress: 10-100× ingress
- Spot: 0.3× on-demand
- Reserved: 0.6× on-demand
Key metrics:
- vCPU-ms per request (efficiency trend)
- Month-over-month consumption change (>15% = investigate)
- Anomaly detection: >2σ spike → alert
Part 10: Strategic Considerations
Architecture to Business Value
Every technical decision maps to business KPIs:
Architecture | Technical Metric | Business KPI | Impact |
---|---|---|---|
Sub-100ms latency | p95 <100ms | CTR | +100ms = -1-2% CTR |
Multi-tier caching | 95% hit rate | Infrastructure cost | 20× database load reduction |
ML CTR (AUC >0.78) | Model accuracy | Revenue/impression | AUC 0.75→0.80 = +10-15% revenue |
Budget pacing | Over-delivery <1% | Advertiser retention | >5% causes churn |
Fraud detection | False positive <0.1% | Platform reputation | 1% invalid traffic = complaints |
Progressive rollouts | Deployment failure <1% | Velocity | Saves 500 engineer-hours/year |
Error budgets | 99.9% SLO | SLA compliance | Breach = 10-30% penalty |
GDPR deletion | <48h SLA | Legal compliance | Violation = 4% revenue fine |
Key insights:
- Latency is revenue: Sub-100ms is competitive advantage
- ML quality compounds: 0.05 AUC improvement = 10-15% revenue
- Operational excellence enables velocity: Ship 3-5× faster with 10× lower risk
- Cost efficiency unlocks scale: Optimizing vCPU-ms per request compounds significantly at 1M+ QPS
- Compliance is existential: GDPR fines can be 4% of global revenue
Evolutionary Architecture
Scenario 1: Video Ads
New requirements: 500KB-2MB files, multi-bitrate streaming, VAST 4.0
Changes:
- Add video CDN (separate from image CDN)
- Video transcoding service
- Client-side viewability SDK
- New metrics: video completion rate
Stays same:
- ML inference (same pipeline, different features)
- Budget pacing (same algorithm)
- Fraud detection (same behavioral analysis)
- Observability (same SLIs/SLOs)
Migration: 6-month phased rollout, 1%→50%→100% traffic
Cost impact: Video CDN 10-20× more expensive. At 10M impressions/day × 1MB = 10TB/day CDN egress.
Scenario 2: Transformer Models
New requirements: GBDT → Transformer for 15-20% CTR improvement
Changes:
- 50-100 A100 GPUs (significant infrastructure investment)
- User history service (last 100 page views)
- Embedding cache (512-dim vectors, 10s TTL)
- Triton Inference Server
- Batch 16-32 requests (+3ms latency)
Stays same:
- Feature store architecture
- Training pipeline (different model type)
- Deployment process (same progressive rollout)
A/B test requirement:
Metric | Control (GBDT) | Treatment (Transformer) | Decision |
---|---|---|---|
Revenue/impression | Baseline | 1.143× baseline | +14.3% (good) |
P99 latency | 87ms | 103ms | +18.4% (concerning) |
Don’t ship yet. Options:
- Optimize inference to <95ms
- Find 10ms savings elsewhere
- Negotiate new SLO: “103ms acceptable for 14% revenue lift”
Balancing User Trust vs Revenue
Core principle: Optimize for long-term user trust, which is the foundation for sustainable revenue.
Guardrail metrics (all experiments must pass):
Guardrail | Threshold | Why |
---|---|---|
Session duration | Must not decrease >2% | Ads driving users away |
Latency p99 | <100ms | Slow = reduced CTR + satisfaction |
Ad block rate | Must not increase >0.5% | Poor quality signal |
Error rate | <0.1% | Technical failures erode trust |
Example: Interstitial Ads Experiment
Metric | Control | Treatment | Delta | Decision |
---|---|---|---|---|
Revenue/user | Baseline | 1.47× baseline | +47% | Good |
Session duration | 12.3 min | 9.8 min | -20% | Violated |
Ad block rate | 2.1% | 3.9% | +86% | Violated |
Do not ship. Short-term +47% revenue, but -20% session duration and +86% ad blocking signals unacceptable user experience. Invest in better ML models or native formats instead.
Designing for Failure: Chaos Engineering
SRE team mandate: Regularly inject failures into production during business hours.
Experiments:
Weekly: Pod termination
- Randomly kill 10% of pods during peak
- Expected: K8s reschedules in 30s, p99 <100ms
- Success: Zero user-facing errors
Bi-weekly: Latency injection
- Inject +50ms between ad server and Redis
- Expected: Circuit breakers trip, fallback to stale cache
- Success: Error rate <0.1%
Monthly: Regional failover
- Simulate datacenter outage (mark us-east-1 unhealthy)
- Expected: GLB redirects to us-west-2 in 60s
- Success: Revenue loss <2%, zero data loss
Quarterly: Database failure
- Take down 2/5 Cassandra nodes
- Expected: RF=3, CL=QUORUM continues serving
- Success: No user impact, repairs within 1h
Culture:
- Announce 24h advance (but not exact time)
- On-call monitors, ready to halt if thresholds breached
- Post-mortem documents gaps, file fixes
Blast radius limits:
- Affect ≤25% traffic initially
- Auto-rollback if error >0.5% or latency >150ms
- Avoid peak revenue hours
Managing Technical Debt
20% of engineering capacity per cycle allocated to debt paydown.
Why 20%?
- Too low (5-10%): debt compounds → system becomes brittle
- Too high (40-50%): over-investing in perfection vs learning from real usage
Prioritization framework:
Impact | Velocity | Risk | Priority | Example |
---|---|---|---|---|
Critical | High | High | P0 | DB migration with no rollback |
High | High | Low | P1 | Monolithic service blocking 3 teams |
Medium | Low | High | P1 | Flaky test (10% failure) |
Low | Low | Low | P2 | Code style inconsistency |
Cultural norm: “Leave it better than you found it”
- Add test if you find poor coverage
- Improve logging if you debug an issue
- Update runbook if you find it outdated
Part 11: Resilience and Failure Scenarios
A robust architecture must survive catastrophic failures, security breaches, and business model pivots. This section addresses three critical scenarios:
Catastrophic Regional Failure: When an entire AWS region fails, our semi-automatic failover mechanism combines Route53 health checks (2-minute detection) with manual runbook execution to promote secondary regions. The critical challenge is budget counter consistency—asynchronous Redis replication creates potential overspend windows during failover. We mitigate this through pre-allocation patterns that limit blast radius to allocated quotas per ad server, bounded by replication lag multiplied by allocation size.
Malicious Insider Attack: Defense-in-depth through zero-trust architecture (SPIFFE/SPIRE for workload identity), mutual TLS between all services, and behavioral anomaly detection on budget operations. Critical financial operations like budget allocations require cryptographic signing with Kafka message authentication, creating an immutable audit trail. Lateral movement is constrained through Istio authorization policies enforcing least-privilege service mesh access.
Business Model Pivot to Guaranteed Inventory: Transitioning from auction-based to guaranteed delivery requires strong consistency for impression quotas. Rather than replacing our eventual-consistent stack, we extend the existing pre-allocation pattern—PostgreSQL maintains source-of-truth counters while Redis provides fast-path allocation with periodic reconciliation. This hybrid approach adds only 10ms to the critical path for guaranteed campaigns while preserving sub-millisecond performance for auction traffic. The 12-month evolution path reuses 80% of existing infrastructure (ML pipeline, feature store, Kafka) while adding campaign management and SLA tracking layers.
These scenarios validate that the architecture is not merely elegant on paper, but battle-hardened for production realities: regional disasters, adversarial threats, and fundamental business transformations.
Part 12: Organizational Design
Team Structure
1. Ad Platform Team
- Ad serving API, auction, budget pacing, fraud
- RTB integration (OpenRTB)
- SLA: 99.9% uptime, p99 <100ms
2. ML Platform Team
- Model training/serving
- Feature stores
- Experimentation platform
- SLA: AUC >0.78, p99 <20ms inference
3. Data Platform Team
- Kafka infrastructure
- Data warehouse
- GDPR compliance tooling
- SLA: 10-min freshness, 48h deletion
4. SRE/Infrastructure Team
- Kubernetes (200+ nodes)
- Redis/Cassandra operations
- Observability infrastructure
- Cost optimization
Cross-Team Collaboration
API contracts as team boundaries:
ML Team provides:
POST /ml/predict/ctr
SLA: p99 <20ms, 99.95% availability
Ad Team consumes without knowing:
- Model type (GBDT vs Transformer)
- Feature computation
- GPU cluster location
Shared observability: All teams contribute to shared dashboards showing business, ML, and infrastructure metrics.
Quarterly roadmap alignment:
Team | Q3 Goals | Dependencies |
---|---|---|
Ad Platform | Launch video ads | Data: Video CDN ML: Engagement model |
ML Platform | Improve AUC 0.78→0.82 | Data: Browsing history SRE: 50 GPUs |
Data Platform | GDPR deletion 48h→24h | SRE: Cassandra upgrade |
SRE | Reduce cost 10% | All: Adopt spot instances |
Tiered on-call:
- L1 (Ad Platform): First responder
- L2 (ML/Data): Backend escalation
- L3 (SRE): Infrastructure escalation
Example incident flow:
12:00 AM: Latency alert (250ms)
12:01 AM: L1 paged
12:02 AM: L1 sees ML latency spike → escalate L2
12:05 AM: L2 finds GPU throttling → escalate L3
12:10 AM: L3 identifies HVAC failure → failover
12:15 AM: Latency restored
Conclusion
This design exercise explored both Day 1 architecture and Day 2 operational realities for a large-scale ads platform:
Architecture:
- Requirements: 400M DAU, 1.5M QPS, sub-100ms
- Multi-tier caching achieving 95% hit rates
- RTB integration via OpenRTB with 30ms auctions
- Real-time ML inference within 40ms budget
- GSP auction mechanisms with game-theoretic properties
- Budget pacing, fraud detection, multi-region failover
Operations:
- Progressive rollouts with automated gates
- SLIs, SLOs, error budgets
- mTLS, PII encryption, secrets management
- Schema evolution, GDPR compliance
- Cost management and attribution
Key Takeaways:
Latency Budget Discipline: Every millisecond counts. Decompose the 100ms budget and optimize critical paths.
Caching is Critical: 95%+ hit rates achievable with three-tier architecture (L1 in-process + L2 Redis + L3 database).
Mathematical Foundations: From queueing theory to game theory to probability, rigorous analysis prevents costly production issues.
Distributed Systems Trade-Offs: Choose consistency level based on data type: strong for financial, eventual for preferences.
Graceful Degradation: Circuit breakers, timeouts, fallbacks prevent cascading failures. Load shedding protects during overload.
Operations Are Not an Afterthought: Progressive rollouts, observability, security, and cost governance are essential for running at scale.
Security and Compliance Are Design Constraints: PII protection and GDPR require architectural decisions from day one. Retrofitting is orders of magnitude harder.
Cost Scales Faster Than Revenue: Without active management, infrastructure spending spirals. Treat cost per request as a first-class metric.
While this remains a design exercise, the patterns explored represent well-established distributed systems approaches. The value was applying principles - queueing theory, game theory, consistency models, resilience patterns - to real-time ad serving constraints.
For anyone actually building such a system:
- Start simpler - most don’t need 1M QPS from day one
- Evolve based on real traffic rather than speculative requirements
- Invest in observability early - you can’t optimize what you can’t measure
- Build security/compliance into foundation - nearly impossible to retrofit
- Track costs from the beginning - they compound quickly at scale
The complexity documented here represents years of evolution. Understanding it helps make better early decisions that won’t need unwinding later.
References
Distributed Systems:
- Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly.
- Dean, J., & Barroso, L. A. (2013). The Tail at Scale. CACM, 56(2).
Caching: 3. Fitzpatrick, B. (2004). Distributed Caching with Memcached. Linux Journal. 4. Nishtala, R., et al. (2013). Scaling Memcache at Facebook. NSDI.
Auction Theory: 5. Vickrey, W. (1961). Counterspeculation, Auctions, and Competitive Sealed Tenders. Journal of Finance. 6. Edelman, B., et al. (2007). Internet Advertising and the Generalized Second-Price Auction. AER. 7. Myerson, R. (1981). Optimal Auction Design. Mathematics of Operations Research.
Machine Learning: 8. McMahan, H. B., et al. (2013). Ad Click Prediction: A View from the Trenches. KDD. 9. He, X., et al. (2014). Practical Lessons from Predicting Clicks on Ads at Facebook. ADKDD.
Real-Time Bidding: 10. IAB Technology Laboratory (2016). OpenRTB API Specification Version 2.5. 11. Yuan, S., et al. (2013). Real-Time Bidding for Online Advertising: Measurement and Analysis. ADKDD.
Fraud Detection: 12. Pearce, P., et al. (2014). Characterizing Large-Scale Click Fraud in ZeroAccess. ACM CCS.
Observability: 13. Beyer, B., et al. (2016). Site Reliability Engineering. O’Reilly. 14. Beyer, B., et al. (2018). The Site Reliability Workbook. O’Reilly. 15. Sigelman, B. H., et al. (2010). Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google Technical Report.
Security & Compliance: 16. GDPR Regulation (EU) 2016/679. General Data Protection Regulation.
Cost Optimization: 17. AWS Well-Architected Framework (2024). Cost Optimization Pillar. 18. Cloud FinOps Foundation (2024). FinOps Framework.