Real-Time Ads Platform: System Foundation & Latency Engineering

Introduction: The Challenge of Real-Time Ad Serving at Scale

Full disclosure: I’ve never built an ads platform before. This is a design exercise - a cognitive workout to keep engineering thinking sharp.

Why Real-Time Ads?

I chose this domain as a deliberate cognitive workout - a concept from Psychology Today about training engineering thinking as AI tools get more powerful. Real-time ads forces specific mental disciplines: 150ms latency budgets train decomposition skills (you can’t handwave “make it fast” when RTB takes 100ms alone), financial accuracy demands consistency modeling (which data needs strong consistency vs eventual), and 1M QPS coordination tests failure handling (when cache servers die, does the database melt down?). These aren’t abstract exercises - they’re the foundation for effective engineering decisions regardless of tooling.

What makes ad platforms compelling: every click has measurable value, every millisecond of latency has quantifiable revenue impact. A user opens an app, sees a relevant ad in under 150ms, clicks it, and the advertiser gets billed. Simple? Not when you’re coordinating real-time auctions across 50+ bidding partners with 100ms timeouts, running ML predictions in <40ms, and handling 1M+ queries per second.

Target scale:

What this post covers:

Building the architectural foundation requires making high-stakes decisions that cascade through every component. This post establishes the critical foundation:

Why this foundation is critical:

Every architectural decision made here creates constraints and opportunities for the entire system:

Get these wrong and you’re building the wrong system. Underestimate latency budgets and you violate SLOs, losing revenue. Misunderstand resilience needs and peak traffic brings cascading failures.

The ad tech industry uses specialized terminology. Let’s establish a common vocabulary before diving into the architecture.

Glossary - Ad Industry Terms

Programmatic Advertising: Automated buying and selling of ad inventory through real-time auctions. Contrasts with direct sales (guaranteed deals with fixed pricing).

SSP (Supply-Side Platform): Platform that publishers use to sell ad inventory. Runs auctions and connects to multiple DSPs to maximize revenue.

DSP (Demand-Side Platform): Platform that advertisers/agencies use to buy ad inventory across multiple publishers. Examples: Google DV360, The Trade Desk, Amazon DSP.

RTB (Real-Time Bidding): Programmatic auction protocol where ad impressions are auctioned in real-time (~100ms) as users load pages/apps. Each impression triggers a bid request to multiple DSPs.

OpenRTB: Industry standard protocol (maintained by IAB Tech Lab) defining the format for RTB communication. Current version: 2.6. Specifies JSON/HTTP format for bid requests and responses.

IAB (Interactive Advertising Bureau): Industry trade organization that develops technical standards (OpenRTB, VAST, VPAID) and provides viewability guidelines for digital advertising.

Pricing Models:

eCPM (Effective Cost Per Mille): Metric that normalizes different pricing models (CPM/CPC/CPA) to “revenue per 1000 impressions” for comparison. Formula: \(eCPM = \frac{\text{Total Earnings}}{\text{Total Impressions}} \times 1000\). Used to rank ads fairly in auctions.

CTR (Click-Through Rate): Percentage of ad impressions that result in clicks. Formula: \(CTR = \frac{\text{Clicks}}{\text{Impressions}} \times 100\). Typical range: 0.5-2% for display ads. Critical for converting CPC bids to eCPM.

With this terminology established, we can now define the system requirements that will drive our architectural decisions.

Requirements and Constraints

Functional Requirements

The system must deliver four core capabilities:

1. Multi-Format Ad Delivery

The platform needs to support all standard ad formats: story ads, video ads, carousel ads, and AR-enabled ads across iOS, Android, and web. Creative assets are served from a CDN targeting sub-100ms first-byte time.

2. Real-Time Bidding (RTB) Integration

The platform implements OpenRTB 2.5+ to coordinate with 50+ demand-side platforms (DSPs) simultaneously. Industry standard RTB timeouts range from 100-200ms, with most platforms targeting 100ms to balance revenue and user experience.

This creates an interesting challenge: executing 50+ parallel network calls within 100ms when some DSPs are geographically distant (NY-Asia RTT: 200-300ms). The system must handle both programmatic and guaranteed inventory with different SLAs and business logic.

3. ML-Powered Targeting and Optimization

Machine learning drives revenue optimization through:

4. Campaign Management

The system provides real-time performance metrics, A/B testing frameworks, frequency capping (limiting ad repetition), quality scoring, and policy compliance.

Architectural Drivers: The Three Non-Negotiables

Before diving into non-functional requirements, we need to establish the three immutable constraints that guide every design decision. Understanding these upfront helps explain the architectural choices throughout this post.

Driver 1: Latency (150ms p95 end-to-end)

Why this matters: Mobile apps timeout after 150-200ms. Users expect ads to load instantly - if your ad is still loading when the page renders, you show a blank space and earn no revenue.

Amazon’s 2006 study found that every 100ms of added latency costs ~1% of sales (this widely-cited metric originates from Amazon’s internal A/B testing, first publicly mentioned by Greg Linden and later referenced by Marissa Mayer at Google; see Kohavi & Longbotham 2007, “Online Controlled Experiments at Large Scale”). In advertising, this translates directly: slower ads = fewer impressions = less revenue.

At our target scale of 1M queries per second, breaching the 150ms timeout threshold means mobile apps give up waiting, resulting in blank ad slots and complete revenue loss on those requests.

The constraint: Maintain 150ms p95 end-to-end latency for the complete request lifecycle - from when the user opens the app to when the ad displays.

Driver 2: Financial Accuracy (Zero Tolerance)

Why this matters: Advertising is a financial transaction. When an advertiser sets a campaign budget, they expect to spend exactly that amount - not 5% more or 5% less.

Billing discrepancies above 2-5% are considered material in industry practice and can trigger lawsuits. Even 1% errors generate complaints and credit demands. Beyond legal risk, billing errors destroy advertiser trust.

The specific billing accuracy thresholds (≤1% target, <2% acceptable, >5% problematic) come from industry best practices and contractual SLAs rather than explicit regulations, though regulatory frameworks (FTC, EU Digital Services Act) do mandate transparent billing.

The constraint: Achieve ≤1% billing accuracy for all advertiser spend. Under-delivery (spending less than budget) costs revenue; over-delivery (spending more than budget) causes legal and trust issues.

Driver 3: Availability (99.9%+ Uptime)

Why this matters: Unlike many services where downtime is annoying but tolerable, ad platforms lose revenue for every second they’re unavailable. No availability = no ads = no money.

A 99.9% uptime target means 43 minutes of allowed downtime per month. This error budget must cover all sources of unavailability. However, through zero-downtime deployment and migration practices (detailed later in Part 4), we can eliminate planned downtime entirely, reserving the full 43 minutes for unplanned failures.

The constraint: Maintain 99.9%+ availability with the system remaining operational even when individual components fail. All planned operations (deployments, schema changes, configuration updates) must be zero-downtime.

Driver 4: Signal Availability (Privacy-First Reality)

Why this matters: AdTech in 2024/2025 is defined by signal loss. Third-party cookies are dying (Chrome Privacy Sandbox), mobile identifiers are restricted (iOS ATT), and privacy regulations (GDPR, CCPA) limit data collection. The assumption that rich “User Profiles” are always available via stable user_id is increasingly false.

The traditional ad tech stack assumed: request arrives → look up user → personalize ad. This breaks when:

The constraint: Design for graceful signal degradation. The system must serve relevant, revenue-generating ads across the full spectrum: from rich identity (logged-in users with full history) to zero identity (anonymous first-visit). This isn’t an edge case - it’s 40-60% of traffic on mobile inventory.

Impact on architecture: The User Profile Service becomes a dual-mode system - identity-based enrichment when available, contextual-only targeting as the primary fallback. ML models must be trained on contextual features (page content, device type, time of day, geo) as first-class signals, not afterthoughts. Revenue expectations must account for lower CPMs on contextual-only inventory (typically 30-50% lower than behaviorally-targeted inventory, though conversion efficiency can be comparable).

When These Constraints Conflict:

These four drivers sometimes conflict with each other. For example, ensuring financial accuracy may require additional verification steps that add latency. Maximizing availability might mean accepting some data staleness that could affect billing precision. Signal availability constraints may force simpler models that reduce revenue optimization.

When trade-offs are necessary, we prioritize:

Financial Accuracy > Availability > Signal Availability > Latency

Rationale: Legal and trust issues from billing errors have longer-lasting impact than temporary downtime; downtime has more severe consequences than privacy-compliant degradation; serving a slightly less personalized ad is better than timing out. Throughout this post, when you see architectural decisions that seem to sacrifice latency or personalization, they’re usually protecting financial accuracy or privacy compliance.

Non-Functional Requirements: Performance Modeling

Formalizing the performance constraints:

Latency Distribution Constraint: $$P(\text{Latency} \leq 150\text{ms}) \geq 0.95$$

This constraint requires 95% of requests to complete within 150ms. Total latency is the sum of all services in the request path:

$$T_{total} = \sum_{i=1}^{n} T_i$$

where \(T_i\) is the latency of each service. With Real-Time Bidding (RTB) requiring 100-120ms for external DSP responses, plus internal services (ML inference, user profile, ad selection), the 150ms budget requires careful allocation.

Strict latency budgets are critical: incremental service calls (“only 10ms each”) compound quickly. The 150ms SLO aligns with industry standard RTB timeout (100-120ms) while maintaining responsive user experience.

Latency Budget Breakdown:

The 150ms total accommodates industry-standard RTB timeout (100ms) while maintaining responsive user experience. Internal services are optimized for <50ms to leave budget for external DSP calls.

RTB Latency Reality Check: The 100ms RTB budget is aggressive given global network physics (NY-London: 60-80ms RTT, NY-Asia: 200-300ms RTT). Understanding RTB timeouts requires distinguishing between specification and operational practice:

Achieving practical 50-70ms operational targets while maintaining 100ms as fallback requires three optimizations:

  1. Geographic sharding - Regional ad server clusters call geographically-local DSPs only (15-25ms RTT)
  2. Dynamic bidder health scoring - De-prioritize or skip consistently slow/low-value DSPs
  3. Adaptive early termination - Progressive auction at 50ms, 70ms, 80ms cutoffs capturing 95-97% revenue

Without these optimizations, global DSP calls would routinely exceed 100ms. Geographic sharding and adaptive timeout strategies are covered in detail in Part 2’s RTB integration section.

Throughput Requirements:

Target peak load: $$Q_{peak} \geq 1.5 \times 10^6 \text{ QPS}$$

Using Little’s Law to relate throughput, latency, and concurrency. With service time \(S\) and \(N\) servers: $$N = \frac{Q_{peak} \times S}{U_{target}}$$

where \(U_{target}\) is target utilization. This fundamental queueing theory relationship helps us understand the capacity needed to handle peak traffic while maintaining acceptable response times.

Availability Constraint:

Target “three nines” (99.9% uptime): $$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}} \geq 0.999$$

where MTBF = Mean Time Between Failures, MTTR = Mean Time To Recovery.

This translates to 43 minutes of allowed downtime per month. Through zero-downtime deployments (detailed in Part 4), we eliminate planned downtime entirely, reserving the full error budget for unplanned failures.

Consistency Requirements:

Different data types require different consistency guarantees. Treating everything as strongly consistent degrades performance, while treating everything as eventually consistent creates financial and correctness issues.

Key insight: The challenge is reconciling strong consistency requirements for financial data with the latency constraints. Without proper atomic enforcement, race conditions could cause severe over-budget scenarios (e.g., multiple servers simultaneously allocating from the same budget). This is addressed through distributed budget pacing with atomic counters, covered in Part 3.

Scale Analysis

Data Volume Estimation:

With 400M Daily Active Users (DAU), averaging 20 ad requests/user/day:

Storage Requirements:

Cache Requirements:

To achieve acceptable response times, frequently accessed data needs to be cached. User access patterns follow a power law distribution where a small fraction of users generate the majority of traffic.

Estimated cache needs: ~800GB of hot data to serve most requests from memory.

Note: Detailed analysis of cache sizing, hit rate optimization, and distribution strategies is covered in Part 3.


System Architecture Overview

Before diving into detailed diagrams and flows, let’s establish the fundamental architectural principles and component structure that shapes this platform.

Service Architecture and Component Boundaries

Before diving into individual components, let’s establish the logical view of the system. The diagram below shows component boundaries and their relationships - this is a conceptual overview to build intuition. Detailed request flows, protocols, and integration patterns follow in subsequent sections.

    
    graph TB
    subgraph "Client Layer"
        CLIENT[Publishers & Users
Mobile Apps, Websites] end subgraph "API Gateway Layer" GW[API Gateway
Auth, Rate Limiting, Routing] end subgraph "Core Request Processing" ORCH[Ad Server Orchestrator
Request Coordination & Auction] end subgraph "Profile & Security Services" PROFILE[User Profile Service
Identity + Contextual Dual-Mode] INTEGRITY[Integrity Check Service
Fraud Detection, Validation] end subgraph "Revenue Engine Services" FEATURE[Feature Store
ML Features Cache] ML[ML Inference Service
CTR Prediction, eCPM Scoring] RTB[RTB Gateway
External DSP Coordination] end subgraph "Financial & Auction Services" AUCTION[Auction Service
Unified eCPM Ranking] BUDGET[Budget Service
Spend Control, Atomic Ops] end subgraph "Storage Layer" CACHE[(L1/L2 Cache
Caffeine + Valkey)] DB[(Database
Transactional Storage)] DATALAKE[(Data Lake
Analytics & ML Training)] end CLIENT --> GW GW --> ORCH ORCH --> PROFILE ORCH --> INTEGRITY ORCH --> ML ORCH --> RTB ORCH --> AUCTION ORCH --> BUDGET PROFILE --> CACHE ML --> FEATURE FEATURE --> CACHE BUDGET --> CACHE PROFILE --> DB BUDGET --> DB AUCTION --> DB ML --> DATALAKE style ORCH fill:#e1f5ff style GW fill:#fff4e1 style CACHE fill:#f0f0f0 style DB fill:#f0f0f0 style DATALAKE fill:#f0f0f0

Note: This diagram represents logical component boundaries, not physical deployment topology. In production, services are distributed across multiple regions with complex networking, service mesh, and data replication - those details are covered in Part 4 and Part 5.

Component Overview

The platform decomposes into focused, independently scalable services. Each service owns a specific domain with clear responsibilities:

Ad Server Orchestrator - The central coordinator that orchestrates the entire ad request lifecycle. Receives requests, coordinates parallel calls to all downstream services (User Profile, Integrity Check, ML Inference, RTB Gateway), manages timeouts, runs the unified auction, and returns the winning ad. Stateless and horizontally scaled to handle 1M+ QPS.

User Profile Service - Manages user targeting data through a dual-mode architecture designed for signal loss reality. When identity is available (stable user_id via login or device ID), enriches requests with demographics, interests, and behavioral history. When identity is unavailable (ATT opt-out, cookie-blocked browsers, new users), falls back to contextual-only mode using request-time signals: page URL/content, device type, geo-IP, time of day, and Topics API categories. Optimized for read-heavy workloads with aggressive caching (95%+ cache hit rate). Tolerates eventual consistency - profile updates can lag by seconds without business impact. The dual-mode design ensures 100% of requests receive targeting signals regardless of identity availability.

Integrity Check Service - Validates request authenticity, detects fraud patterns, enforces rate limits. First line of defense against bot traffic and malicious requests. Must be fast (5ms budget) to stay off critical path.

Feature Store - Serves pre-computed ML features for CTR prediction. Fed by batch and streaming pipelines that aggregate user engagement history, contextual signals, and temporal patterns. Caches features aggressively to meet 10ms latency budget.

ML Inference Service - Runs gradient boosted decision trees (GBDT) for click-through rate prediction. Converts advertiser bids (CPM/CPC/CPA) into comparable eCPM scores for fair auction ranking. CPU-based inference for cost efficiency at 1M QPS scale.

RTB Gateway - Broadcasts bid requests to 50+ external demand-side platforms (DSPs) via OpenRTB protocol. Handles connection pooling, timeout management, partial auction logic. Geographically distributed to minimize latency to DSP data centers.

Auction Service - Executes the unified auction that ranks all bids (internal ML-scored + external RTB) by eCPM. Applies quality scores, reserve prices, and selects the winner. Stateless computation - no data persistence.

Budget Service - Enforces advertiser campaign budgets through distributed atomic operations. Requires strong consistency - cannot tolerate budget overspend. Uses distributed cache with atomic compare-and-swap operations and pre-allocation pattern to achieve 3ms latency.

Why these boundaries:

Service boundaries align with data access patterns, consistency requirements, and scaling characteristics:

Stateless Design Philosophy

All request-handling services (Ad Server, Auction, ML Inference, RTB Gateway) are stateless - they hold no session state between requests. This enables:

State lives in dedicated storage layers (multi-tier cache hierarchy and strongly-consistent databases) accessed by stateless services. This separation of compute and storage is fundamental to the architecture.

Service Independence and Failure Isolation

Services communicate synchronously (gRPC) but are designed to fail independently:

This failure isolation is critical at 1M QPS - any service failure must degrade gracefully rather than propagate.

Detailed implementation of RTB Gateway (OpenRTB protocol, DSP coordination, timeout handling) and ML Inference pipeline (Feature Store architecture, GBDT model serving, feature engineering) are covered in Part 2.

Data Architecture

State management drives many architectural decisions. The platform requires three distinct storage patterns, each with different consistency, latency, and access characteristics.

Storage Pattern Requirements

Pattern 1: Strongly Consistent Transactional Data

Pattern 2: High-Throughput Atomic Operations

Pattern 3: Read-Heavy Profile Data

Consistency Requirements by Data Type

Different data has different correctness requirements:

Data TypeConsistency NeedStorage PatternRationale
Advertiser budgetsStrong (≤1% variance)Pattern 2 + Pattern 1 ledgerFinancial accuracy non-negotiable
User profilesEventual (seconds lag OK)Pattern 3Profile updates don’t need instant visibility
Campaign configsStrong (immediate visibility)Pattern 1Advertiser changes must take effect immediately
ML featuresEventual (minutes lag OK)Pattern 2 cacheStale features have minimal impact on CTR prediction
Billing eventsStrong (linearizable)Pattern 1 with ordering guaranteesFinancial audit trails require total ordering

This tiered approach optimizes for both performance (eventual consistency where acceptable) and correctness (strong consistency where required).

Caching Strategy

To meet the 10ms latency budget for user profile and feature lookups at 1M+ QPS, aggressive caching is mandatory. A multi-tier cache hierarchy reduces database load by 95%:

Part 3 covers the complete data layer: specific technology selection for strongly-consistent transactional storage, distributed caching, and user profile storage, plus cache architecture implementation, hit rate optimization, invalidation strategies, and clustering patterns.

Communication Architecture

Services communicate synchronously using a binary RPC protocol for internal calls and REST for external integrations. This section explains why these choices align with latency requirements and operational constraints.

Internal Service Communication: Binary RPC

All internal service-to-service calls (Ad Server → User Profile, Ad Server → ML Service, etc.) use a binary RPC protocol over HTTP/2.

Why binary RPC:

At 1M QPS scale, JSON serialization would add 2-5ms per request - consuming 40-50% of the latency budget. Binary protocols keep serialization overhead under 1ms.

External Communication: REST/JSON

External integrations (RTB DSPs, client apps) use REST with JSON over HTTP/1.1 or HTTP/2.

Why REST for external:

Trade-off accepted: External REST calls (RTB) have higher serialization overhead, but they’re already consuming 100ms for network RTT - the 2-5ms JSON overhead is negligible compared to network latency.

Why Not Asynchronous Messaging?

The architecture is synchronous request/response rather than event-driven/async messaging.

Why synchronous:

Async messaging exists for non-critical-path workflows (billing events, analytics pipelines, ML feature computation), but the ad serving critical path is fully synchronous.

Service Discovery

Services discover each other via DNS-based service discovery within the container orchestration platform.

Part 5 (Final Architecture) covers complete technology selection and configuration: gRPC setup, container orchestration architecture, connection pooling strategies, and service mesh implementation.

Deployment Architecture

The platform deploys as a distributed system across multiple regions. This section establishes the deployment model and scaling principles - specific instance counts, cluster sizing, and resource allocation are covered in Part 5’s implementation blueprint.

Horizontal Scaling Model

All request-handling services are stateless and scale horizontally by adding instances. This architectural choice enables:

Elastic capacity management:

Fault tolerance:

Zero-downtime deployments:

Scaling characteristics by service type:

Why stateless matters: At 1M+ QPS, stateful services create operational nightmares - instance failures require state migration, deploys need session draining, and horizontal scaling requires data sharding. Stateless design eliminates these concerns by pushing state to dedicated storage layers (distributed cache, database) that are designed for consistency and durability.

Multi-Region Deployment

The platform deploys across multiple geographic regions to satisfy availability, latency, and data sovereignty requirements.

Why multi-region is mandatory:

Regional deployment model:

Data layer considerations:

Failover behavior: When a region fails health checks, GeoDNS redirects traffic to next-nearest healthy region within 2-5 minutes. The surviving regions absorb the additional load without user-visible degradation due to over-provisioned capacity.

Operational details of multi-region failover (GeoDNS health checks, split-brain prevention, regional budget pacing, RTO/RPO targets) are covered in Part 4. Specific regional sizing, instance counts, and cluster configurations are detailed in Part 5.

Financial Integrity: Immutable Audit Log

Compliance Requirement:

The operational ledger (CockroachDB) is mutable by design - rows can be updated for budget corrections, deleted during cleanup, or modified by database administrators. This violates SOX (Sarbanes-Oxley) and tax compliance requirements for non-repudiable financial records. Regulators and auditors require immutable, cryptographically verifiable transaction history that cannot be tampered with after the fact.

Architectural Solution:

Implement dual-ledger architecture separating concerns:

Every financial operation publishes an event to Kafka financial-events topic, which ClickHouse consumes into append-only MergeTree tables. ClickHouse retains records for 7 years (tax compliance requirement) with hash-based integrity verification preventing undetected tampering. Daily reconciliation job compares both systems to detect discrepancies.

Trade-off: Additional infrastructure complexity (Kafka cluster + ClickHouse deployment) and operational overhead (reconciliation monitoring) for regulatory compliance and audit confidence. Cost increase approximately 15-20% of database infrastructure budget, but eliminates compliance risk and enables advertiser dispute resolution with verifiable records.

Detailed architecture covered in Part 3’s Immutable Audit Log section, implementation details in Part 5.

Load Balancing and Traffic Distribution

Traffic flows through multiple load balancing layers, each serving a distinct purpose:

1. GeoDNS (Global Traffic Distribution)

2. Regional Load Balancer (Availability Zone Distribution)

3. Service Mesh (Service Instance Distribution)

4. Client-Side Load Balancing (RPC-Level Distribution)

Why multi-tier load balancing: Each layer optimizes for different failure domains and timescales. GeoDNS handles region failures (minutes), regional LB handles zone failures (seconds), service mesh handles instance failures (sub-second), and client-side LB handles request-level distribution (milliseconds).

This layered approach ensures traffic always reaches healthy capacity at every level of the infrastructure stack.

High-Level Architecture

System Components and Request Flow

    
    graph TB
    subgraph "Client Layer"
        CLIENT[Mobile/Web Client
iOS, Android, Browser] end subgraph "Edge Layer" CDN[Content Delivery Network
Global PoPs
Static assets] GLB[Global Load Balancer
GeoDNS + Health Checks] end subgraph "Regional Service Layer - Primary Region" GW[API Gateway
Rate Limiting: 1M QPS
Auth: JWT/OAuth
Service Mesh Integration] AS[Ad Server Orchestrator
Stateless, Horizontally Scaled
150ms latency budget] subgraph "Core Services" UP[User Profile Service
Identity + Contextual
Target: 10ms] INTEGRITY[Integrity Check Service
Lightweight Fraud Filter
Target: <5ms] AD_SEL[Ad Selection Service
Candidate Retrieval
Target: 15ms] ML[ML Inference Service
CTR Prediction
Target: 40ms] RTB[RTB Auction Service
OpenRTB Protocol
Target: 100ms] BUDGET[Atomic Pacing Service
Pre-Allocation
Strong Consistency] AUCTION[Auction Logic
Combine Internal + RTB
First-Price Auction] end subgraph "Data Layer" DISTRIBUTED_CACHE[(Distributed Cache
Atomic Operations
Budget Enforcement)] TRANSACTIONAL_DB[(Strongly Consistent DB
Billing Ledger + User Profiles
Logical Timestamps
Multi-Region ACID)] FEATURE_STORE[(Feature Store
ML Features
Sub-10ms p99)] end end subgraph "Data Processing Pipeline - Background" EVENT_STREAM[Event Streaming
100K events/sec] STREAM_PROC[Stream Processing
Real-time Aggregation] BATCH_PROC[Batch Processing
Feature Engineering] DATA_LAKE[(Object Storage
Data Lake + Cold Archive
500TB+ daily + 7-year retention)] end subgraph "ML Training Pipeline - Offline" WORKFLOW[Workflow Orchestration] TRAIN[Training Cluster
Daily CTR Model
Retraining] REGISTRY[Model Registry
Versioning
A/B Testing] end subgraph "Observability" METRICS[Metrics Collection
Time-series DB] TRACING[Distributed Tracing
Span Collection] DASHBOARDS[Visualization
Dashboards & Alerts] end CLIENT --> CDN CLIENT --> GLB GLB --> GW GW --> AS AS -->|Fetch User| UP AS -->|Check Fraud| INTEGRITY AS -->|Get Ads| AD_SEL AS -->|RTB Parallel| RTB AS -->|Score Ads| ML AS -->|Run Auction| AUCTION AS -->|Check Budget| BUDGET UP -->|Read| DISTRIBUTED_CACHE UP -->|Read| TRANSACTIONAL_DB INTEGRITY -->|Read Bloom Filter| DISTRIBUTED_CACHE INTEGRITY -->|Read Reputation| DISTRIBUTED_CACHE AD_SEL -->|Read| DISTRIBUTED_CACHE AD_SEL -->|Read| TRANSACTIONAL_DB ML -->|Read Features| FEATURE_STORE RTB -->|OpenRTB 2.x| EXTERNAL[50+ DSP Partners] BUDGET -->|Atomic Ops| DISTRIBUTED_CACHE BUDGET -->|Audit Trail| TRANSACTIONAL_DB AS -.->|Async Events| EVENT_STREAM EVENT_STREAM --> STREAM_PROC STREAM_PROC --> DISTRIBUTED_CACHE STREAM_PROC --> DATA_LAKE BATCH_PROC --> DATA_LAKE BATCH_PROC --> FEATURE_STORE TRANSACTIONAL_DB -.->|Nightly Archive
90-day-old records| DATA_LAKE WORKFLOW --> TRAIN TRAIN --> REGISTRY REGISTRY --> ML AS -.-> METRICS AS -.-> TRACING classDef client fill:#e1f5ff,stroke:#0066cc classDef edge fill:#fff4e1,stroke:#ff9900 classDef service fill:#e8f5e9,stroke:#4caf50 classDef data fill:#f3e5f5,stroke:#9c27b0 classDef stream fill:#ffe0b2,stroke:#e65100 class CLIENT client class CDN,GLB edge class GW,AS,UP,AD_SEL,ML,RTB,BUDGET,AUCTION service class DISTRIBUTED_CACHE,TRANSACTIONAL_DB,FEATURE_STORE,DATA_LAKE data class EVENT_STREAM,STREAM_PROC,BATCH_PROC stream

Request Flow Sequence:

The diagram above shows both the critical request path (solid lines) and background processing (dotted lines). Here’s what happens during a single ad request:

1. Request Ingress (15ms total)

2. Identity & Fraud Verification (15ms sequential)

3. Parallel Path Split (ML + RTB run simultaneously after fraud check)

Path A: Internal ML Path (65ms after split)

Path B: External RTB Auction (100ms after split - CRITICAL PATH)

4. Unified Auction and Response (13ms avg, 15ms p99)

Total: 143ms avg (145ms p99) (15ms ingress + 10ms User Profile + 5ms Integrity Check + 100ms RTB + 13ms auction/budget/response, with ML path completing in parallel at 65ms after split)

Background Processing (Asynchronous):

Data Dependencies:

Latency Budget Decomposition

For a 150ms total latency budget, we decompose the request path:

$$T_{total} = T_{network} + T_{gateway} + T_{services} + T_{serialization}$$

Network Overhead (Target: 10ms)

API Gateway (Target: 5ms)

Technology Selection: API Gateway

API Gateway Requirements:

Architectural Driver: Latency - The API gateway must operate within a 5ms latency budget while providing authentication, rate limiting, and traffic routing at 1M+ QPS scale.

Key requirements:

Latency budget breakdown:

Specific technology selection (gateway products, configuration, and deployment patterns) is covered in Part 5.

Service-Level SLA Summary

Consolidated latency targets driving technology selection, deployment architecture, and monitoring:

ServiceTarget LatencyPercentileCritical PathNotes
Overall Orchestrator150msP99YesEnd-to-end SLO (143ms avg, 145ms p99 actual)
Network Overhead10msAverageYesClient→Edge (5ms) + Edge→Service (5ms)
API Gateway5msAverageYesAuth (2ms) + Rate Limit (1ms) + Routing (2ms)
User Profile Service10msTargetYesIdentity + contextual data retrieval
Integrity Check<5msTargetYesFraud prevention (first defense layer)
Ad Selection Service15msTargetParallelCandidate retrieval from storage
Feature Store10msP99ParallelML feature lookup (degrades at >15ms)
ML Inference Service40msBudgetParallelCTR prediction for auction ranking (~20ms actual GBDT inference, 40ms budget includes overhead)
RTB Auction Service50-70msOperationalYesExternal DSP coordination (100ms p95, 120ms p99 hard)
Auction Logic3msAverageYeseCPM ranking + winner selection
Budget Check3ms (5ms p99)AverageYesAtomic spend control with strong consistency
Response Serialization5msAverageYesAd response formatting

Critical path: Network (10ms) → Gateway (5ms) → User Profile (10ms) + Integrity (5ms) → RTB dominates at 100ms (ML completes at 65ms in parallel) → Auction (3ms) + Budget (3ms) + Serialization (5ms) = 143ms average, 145ms p99

Rate Limiting: Volume-Based Traffic Control

Rate limiting protects infrastructure from overload while ensuring fair resource allocation across clients. This section covers the architectural pattern for distributed rate limiting at 1M+ QPS scale.

Why Rate Limiting:

  1. Infrastructure protection: Prevents single client from overwhelming 1.5M QPS platform capacity
  2. Cost control: Limits outbound calls to external DSPs (50+ partners × 1M QPS = massive API costs without controls)
  3. Fair allocation: Ensures large advertisers don’t starve smaller ones
  4. SLA enforcement: API contracts specify tiered rate limits per advertiser

Rate Limiting vs Fraud Detection:

These are complementary mechanisms:

Pattern-based fraud detection (device fingerprinting, behavioral analysis, bot detection) is covered in Part 4.

Multi-Tier Architecture:

TierScopeLimitPurpose
GlobalEntire platform1.5M QPSProtect total capacity
Per-IPClient IP10K QPSPrevent single-source abuse
Per-AdvertiserAPI key1K-100K QPS (tiered)SLA enforcement + fairness
DSP outboundExternal calls50K QPS totalControl API costs

Distributed Rate Limiting Pattern:

The core architectural challenge: enforcing global rate limits across 100+ distributed gateway nodes without centralizing every request.

Approach: Token bucket algorithm with distributed cache-backed state

Key trade-off:

Latency Budget:

Complete Request Latency:

Critical Path and Dual-Source Architecture

The platform serves ads from two independent inventory sources that compete in a unified auction:

Both sources compete in final auction. Highest eCPM wins (internal or external). This dual-source model enables parallel execution: ML scores internal inventory while RTB collects external bids simultaneously.

Architectural Driver: Revenue Optimization - Unified auction maximizes revenue per impression by ensuring best ad wins regardless of source. Industry standard: Google Ad Manager, Amazon Publisher Services, Prebid.js.

Why parallel execution works: ML and RTB operate on independent ad inventories. ML doesn’t need RTB results (scoring internal ads from our database). RTB doesn’t need ML results (DSPs bid independently). Only synchronize at final auction when both paths complete.

For detailed business model, revenue optimization, and economic rationale, see the “Ad Inventory Model and Monetization Strategy” section in the RTB integration post of this series.

Request Flow and Timing

The critical path is determined by RTB Auction (100ms), which dominates the latency budget. Internal ML processing runs in parallel and completes faster at 65ms:

    
    graph TB
    A[Request Arrives] -->|5ms| B[Gateway Auth]
    B --> C[User Profile
10ms
Cache hierarchy] C --> IC[Integrity Check
5ms CRITICAL
Lightweight fraud filter
Bloom filter + basic rules
BLOCKS fraudulent requests] IC -->|PASS| FS[Feature Store Lookup
10ms
Behavioral features] IC -->|PASS| F[RTB Auction
100ms CRITICAL PATH
OpenRTB to 50+ external DSPs
Source 2: External inventory] IC -->|BLOCK| REJECT[Reject Request
Return house ad or error
No RTB call made] FS --> D[Ad Selection
15ms
Query internal ad DB
Direct deals + guaranteed
Source 1: Internal inventory] D --> E[ML Inference
40ms
CTR prediction on internal ads
Output: eCPM-scored ads] E --> G[Synchronization
Wait for both sources
Internal: ready at 85ms
External RTB: at 120ms] F --> G G -->|5ms| H[Unified Auction
Combine Source 1 + Source 2
Select highest eCPM
Winner: internal OR external] H -->|5ms| I[Response] style F fill:#ffcccc style IC fill:#ffdddd style C fill:#ffe6e6 style FS fill:#e6f3ff style G fill:#fff4cc style H fill:#e6ffe6 style REJECT fill:#ff9999

Critical Path (from diagram): Gateway (5ms) → User Profile (10ms) → Integrity Check (5ms) → RTB Auction (100ms) → Sync → Final Auction (8ms avg, 10ms p99) → Response (5ms) = 133ms avg service layer (135ms p99)

Parallel path (Internal ML): Gateway (5ms) → User Profile (10ms) → Integrity Check (5ms) → Feature Store (10ms) → Ad Selection (15ms) → ML Inference (40ms) → Sync (waiting) = 85ms

Note: Diagram shows service layer only. Add 10ms network overhead at the start for 143ms avg total request latency (145ms p99) with 5ms buffer to 150ms SLO.

Critical Design Decision: Integrity Check Placement - The 5ms Integrity Check Service runs BEFORE the RTB fan-out to 50+ DSPs. This prevents wasting bandwidth and DSP processing time on fraudulent traffic. Cost impact: blocking 20-30% bot traffic before RTB eliminates massive egress bandwidth costs (RTB requests to external DSPs incur data transfer charges). At scale (1M QPS, 50+ DSPs, 2-4KB payloads), early fraud filtering saves thousands of times more in annual bandwidth costs than the 5ms latency investment costs in lost impressions.

Component explanations (referencing dual-source architecture above):

Parallel Execution and Unified Auction

Why parallel execution works: ML and RTB operate on completely independent ad inventories with no data dependency. ML scores internal inventory (direct deals in our database), while RTB collects bids from external DSPs (advertiser networks). They only merge at the final auction.

Synchronization Point timing:

  1. ML path completes at t=85ms: Internal ads scored and cached
  2. ML thread waits idle from t=85ms to t=120ms (35ms idle time)
  3. RTB path completes at t=120ms: External DSP bids arrive
  4. Both results available → proceed to Final Auction at t=120ms

Unified Auction logic (8ms avg, 10ms p99: 3ms auction + 3ms avg budget check [5ms p99] + 2ms overhead): Unified auction algorithm:

  1. Calculate eCPM for internal ads:

    • eCPM = predicted_CTR × base_CPM × 1000
    • Example: 0.05 CTR × base_CPM of 3 × 1000 = eCPM of 150
  2. Use eCPM from RTB bids:

    • DSP bids are already in eCPM format
    • No conversion needed
  3. Select winner:

    • Choose candidate with highest eCPM across all sources
    • Winner can be internal ad OR external RTB bid

Example outcome: Auction results:

Publisher earns highest bid for this impression. If an internal ad scored eCPM of 190 (highly personalized match), it would beat RTB - ensuring maximum revenue regardless of source.

Latency comparison:

Why we can’t start auction earlier: We need BOTH ML-scored ads AND RTB bids for complete auction. Starting before RTB completes excludes external bidders, losing potential revenue.

Resilience: Graceful Degradation and Circuit Breaking

The critical path analysis above assumes all services operate within their latency budgets. But what happens when they don’t? The 150ms SLO leaves only a 15ms buffer - if any critical service exceeds its budget, the entire request fails.

Architectural Driver: Availability - Serving a less-optimal ad quickly beats serving no ad at all. When services breach latency budgets, degrade gracefully through fallback layers rather than timing out.

Example scenario: ML inference allocated 40ms, but CPU load spikes push p99 latency to 80ms. Options:

The answer: graceful degradation. Better to serve a less-optimal ad quickly than perfect ad slowly (or no ad at all).

Degradation Hierarchy: Per-Service Fallback Layers

Each critical-path service has a latency budget and a degradation ladder defining fallback behavior when budgets are exceeded. The table below shows all degradation levels across the three most critical services:

LevelML Inference
(40ms budget)
User Profile
(10ms budget)
RTB Auction
(100ms budget)
Level 0
Normal
GBDT on CPU
Latency: 20ms
Revenue: 100%
Trigger: p99 < 40ms
Transactional DB + distributed cache
Latency: 8ms
Accuracy: 100%
Trigger: p99 < 10ms
Query all 50 DSPs
Latency: 85ms
Revenue: 100%
Trigger: p95 < 100ms
Level 1
Light Degradation
Cached predictions
Cached CTR predictions
Latency: 5ms
Revenue: 92% (-8%)
Trigger: p99 > 40ms for 60s
Stale cache
Extended TTL cache
Latency: 2ms
Accuracy: 95% (-5%)
Trigger: p99 > 10ms for 60s
Top 30 DSPs only
Highest-value DSPs
Latency: 80ms
Revenue: 95% (-5%)
Trigger: p95 > 100ms for 60s
Level 2
Moderate Degradation
Heuristic model
Rule-based CTR
Latency: 2ms
Revenue: 85% (-15%)
Trigger: Cache miss > 30%
Segment defaults
Demographic avg
Latency: 1ms
Accuracy: 70% (-30%)
Trigger: DB unavailable
Top 10 DSPs only
Ultra-high-value only
Latency: 75ms
Revenue: 88% (-12%)
Trigger: p95 > 110ms for 60s
Level 3
Severe Degradation
Global average
Category avg CTR
Latency: 1ms
Revenue: 75% (-25%)
Trigger: Still breaching SLA
N/ASkip RTB entirely
Direct inventory only
Latency: 0ms
Revenue: 65% (-35%)
Trigger: All DSPs timeout

Key observations:

Mathematical Model of Degradation Impact:

Total revenue under degradation:

$$R_{degraded} = R_{baseline} \times (1 - \alpha) \times (1 + \beta \times \Delta L)$$

where:

Example: Level 1 degradation (cached predictions):

But compare to the alternative:

Circuit Breakers: Automated Degradation Triggers

Degradation shouldn’t require manual intervention. Implement circuit breakers that automatically detect when services exceed latency budgets and switch to fallback layers.

Circuit breaker pattern: Monitor service latency continuously. When a service consistently breaches its budget, “trip” the circuit and route traffic to the next degradation level until the service recovers.

Three-state circuit breaker:

Goal: Automatically detect service degradation and route around it, then carefully test recovery before fully restoring traffic.

CLOSED (normal operation):

OPEN (degraded mode):

HALF-OPEN (testing recovery):

Configuration approach:

Per-service circuit breaker thresholds:

ServiceBudgetTrip ThresholdFallbackRevenue Impact
ML Inference40msp99 > 45ms
for 60s
Cached CTR predictions-8%
User Profile10msp99 > 15ms
for 60s
Stale cache (5min TTL)-5%
RTB Auction100msp95 > 105ms
for 60s
Top 20 DSPs only
(Note: p99 protected by 120ms absolute cutoff*)
-6%
Ad Selection15msp99 > 20ms
for 60s
Skip personalization, use category matching-12%

*RTB p99 protection: The 120ms absolute cutoff forces immediate fallback to internal inventory or House Ad when RTB exceeds the hard timeout, preventing P99 tail requests (10,000 req/sec at 1M QPS) from timing out at the mobile client. See P99 Tail Latency Defense for complete strategy.

Composite Degradation Impact:

If all services degrade simultaneously (worst case, e.g., during regional failover):

$$R_{total} = R_{baseline} \times (1 - 0.08) \times (1 - 0.05) \times (1 - 0.06) \times (1 - 0.12)$$ $$R_{total} \approx 0.92 \times 0.95 \times 0.94 \times 0.88 = 0.728 R_{baseline}$$

Result: ~27% revenue loss under full degradation, but system stays online. Compare to outage scenario: 100% revenue loss.

Recovery Strategy:

Hysteresis prevents flapping:

$$ \begin{aligned} \text{Degrade if: } & L_{p99} > L_{budget} + 5ms \text{ for } 60s \\ \text{Recover if: } & L_{p99} < L_{budget} - 5ms \text{ for } 300s \end{aligned} $$

Asymmetric thresholds (5ms tolerance vs 5ms buffer, 60s vs 300s duration) prevent oscillation between states. Example: CPU latency spike trips circuit at t=60s, switches to cached predictions; after 5min of healthy p99<35ms latency, circuit closes and resumes normal GBDT inference.

Monitoring Degradation State:

Track composite degradation score: \(Score = \sum_{i \in \text{services}} w_i \times \text{Level}_i\) where \(w_i\) reflects revenue impact (ML=0.4, RTB=0.3, Profile=0.2, AdSelection=0.1). Alert on: any service at Level 2+ for >10min (P2), composite score >4 (P1 - cascading failure risk), revenue <85% forecast (P1), circuit flapping >3 transitions/5min.

Testing Degradation Strategy:

Validate via chaos engineering: (1) Inject 50ms latency to 10% ML requests, verify circuit trips and -8% revenue impact matches prediction; (2) Terminate 50% ML inference pods, confirm graceful degradation within 60s; (3) Quarterly regional failover drills validating <30% revenue loss and measuring recovery time.

Trade-off Articulation:

Why degrade rather than scale?

You might ask: “Why not just auto-scale ML inference pods when latency spikes?”

Problem: Provisioning new CPU pods takes 15-30 seconds with modern tooling (pre-warmed container images, model pre-loading) - instance boot + model loading into memory + JVM warmup. During traffic spikes, you’ll still breach SLAs for 15-30 seconds before new capacity comes online.

Note: Without optimization (cold container pulls, full model loading from object storage, cold JVM), cold start can take 60-90 seconds. The 15-30s baseline assumes modern best practices: pre-warmed images, model streaming, and container image caching.

Cost-benefit comparison:

StrategyLatency ImpactRevenue ImpactCost
Wait for CPU
(no degradation)
150ms
total → timeout
-100%
on timed-out requests
None
Scale CPU instances30s of 80ms
latency → partial timeouts
-15%
during scale-up window
+20-30% CPU baseline for burst capacity
Degrade to cached predictions5ms
immediate
-8%
targeting accuracy
None

Decision: Degradation costs less (-8% vs -15%) and reacts faster (immediate vs 30s).

But we still auto-scale! Degradation buys time for auto-scaling to provision capacity. Once new CPU pods are healthy (30s later), circuit closes and we return to normal operation.

Degradation is a bridge, not a destination.

P99 Tail Latency Defense: The Unacceptable Tail

At 1 million QPS, the P99 tail represents 10,000 requests per second - a volume too large to ignore. Without P99 protection, these requests risk timeout, resulting in blank ads and complete revenue loss on the tail.

Architectural Driver: Revenue Protection - The P99 tail is dominated by garbage collection pauses and the slowest RTB bidder. Protecting these 10,000 req/sec requires infrastructure choices (low-pause GC) and operational discipline (hard timeouts with forced failure).

Two Primary P99 Contributors:

  1. Garbage Collection Pauses: Traditional garbage collectors can produce 10-50ms stop-the-world pauses, consuming 7-33% of the 150ms latency budget
  2. Slowest RTB Bidder: With 25-30 DSPs per auction, a single slow bidder (110-120ms) can push total latency over the SLO

Defense Strategy 1: Low-Pause GC Technology

Requirement: Sub-2ms GC pause times at P99.9

At 1M QPS serving hundreds of thousands of requests per second per instance, managed runtime garbage collection becomes a critical latency contributor. Traditional stop-the-world collectors can pause application threads for 10-50ms, directly violating latency budgets.

Why it matters: Without low-pause GC, traditional collectors can add 41-55ms to P99.9 latency, violating the 150ms SLO and causing mobile client timeouts.

Technology options:

Part 5 (Final Architecture) covers complete GC technology selection: specific collectors (low-pause concurrent GC, incremental GC), runtime comparisons (JVM vs Go vs Rust), configuration details, and performance validation.

Defense Strategy 2: RTB 120ms Absolute Cutoff

Hard timeout at 120ms forces the Ad Server to cancel all pending RTB requests and fail over to fallback inventory:

Why 120ms? This ensures total latency stays within 153ms even at P99 (Gateway 5ms + User Profile 10ms + Integrity Check 5ms + RTB 120ms + Auction 8ms + Response 5ms = 153ms). A 3ms SLO violation is acceptable; a mobile timeout (>200ms) is not.

Trade-off Analysis:

Better to serve a guaranteed ad at 120ms than wait for a perfect RTB bid that might never arrive. The P99 tail (1% of traffic) sacrifices 40-60% of optimal revenue to prevent 100% loss from timeouts and the compounding UX damage of blank ads (which reduces CTR across ALL traffic by 0.5-1%).

Part 4 covers implementation details: request cancellation patterns, fallback logic, monitoring strategies, and chaos testing for P99 defense.

Defense Strategy 3: Hedge Requests for Read Paths

While ZGC eliminates GC pauses and hard timeouts handle slow RTB bidders, neither addresses application logic stalls or network jitter on internal read paths. A single slow User Profile or Feature Store lookup can push P99 over budget despite all other optimizations.

The pattern: Hedge requests, introduced by Dean and Barroso in “The Tail at Scale” (2013), send the same read request to two replicas, taking the first response and discarding the second. Google demonstrated this reduces 99.9th percentile latency from 1,800ms to 74ms with only 2% additional load.

Where to apply hedge requests:

Where NOT to apply:

CRITICAL: Never hedge write operations or non-idempotent methods

Hedging executes requests multiple times on the server. gRPC documentation explicitly states: “Hedged RPCs may execute more than once on a server so only idempotent methods should be hedged.”

Implementation safety: Use explicit service allowlist in gRPC configuration to prevent accidental hedging. Only enable for services explicitly designed as read-only and idempotent (UserProfileService, FeatureStoreService).

Trade-off analysis:

Implementation approach:

The pattern uses asynchronous request handling with timeout-based triggers. The primary request starts immediately to the first replica. If it doesn’t complete within the P95 latency threshold, a secondary request fires to a different replica. Whichever response arrives first wins; the slower response is discarded.

Client-side configuration (Ad Server → User Profile gRPC):

When to trigger hedge: Per the original paper, defer hedge requests until the primary has been outstanding longer than the 95th percentile latency for that service. For User Profile (P95 ~3ms), trigger hedge at 3ms. This limits additional load to ~5% while substantially shortening the tail - only requests in the slow tail trigger the hedge.

Monitoring: Track hedge_request_rate and hedge_win_rate. If hedge requests win >20% of the time, investigate why primary is consistently slow.

Advanced Optimizations and Safety Mechanisms:

The baseline hedge implementation adds ~5% load (requests in the slow tail). Two production-validated optimizations improve effectiveness while one critical safety mechanism prevents cascading failures:

1. Load-Aware Hedge Routing via Service Mesh

Leverage service mesh built-in load balancing rather than random replica selection:

2. Request Cancellation on First Response

Cancel the slower request immediately when first response arrives:

3. Circuit Breaker for Hedge Safety (Critical)

Prevent thundering herd during system degradation:

The problem adaptive thresholds tried to solve - and why they fail:

Initial intuition suggests: “During degradation, hedge more aggressively to maintain SLOs.” This leads to adaptive thresholds that lower the hedge trigger (P95 → P90) when P50 latency increases, raising hedge rate from 5% to 10%. This is backwards. When User Profile Service is degraded (e.g., Valkey partial outage slows L2 cache), ALL requests exceed the P95 threshold → hedge rate spikes to 100% → effective load doubles (2× every request) → replicas saturate → P50 increases further → more hedging → cascading failure.

No production systems use adaptive hedge thresholds. Instead, they use circuit breakers to disable hedging during overload.

The Netflix/Hystrix pattern:

Circuit breaker monitors hedge rate and throttles immediately rather than waiting for system to break:

Why 15-20% threshold: Baseline hedge rate should be ~5% (only slow tail requests). If rate climbs to 15-20%, it indicates widespread degradation where hedging adds load without benefit - primary and hedge requests are both slow.

Production precedent: Netflix Hystrix emphasizes that “concurrency limits and timeouts are the proactive portion that prevent anything from going beyond limits and throttle immediately, rather than waiting for statistics or for the system to break.” The circuit breaker is “icing on the cake” that provides the safety valve.

Combined impact:

Production implementation guidance:

Start with baseline (P95 threshold, no optimizations):

  1. Enable hedging for User Profile Service only via gRPC service configuration
  2. Configure service mesh for hedging-eligible methods (read-only, idempotent operations)
  3. Implement circuit breaker monitoring (track hedge rate, disable if >15% for 60s)
  4. Require server-side cancellation handling (check cancellation token, abort work)

gRPC native hedging configuration specifies maximum attempts (primary plus one hedge), hedging delay (P95 latency threshold), and which error codes should trigger hedging versus failing fast. The client automatically cancels slower requests when first response arrives, but servers must cooperatively check cancellation status and stop processing.

Trade-offs to accept:

This approach adds three types of complexity worth the 30-40% P99.9 latency benefit:

Only implement after validating baseline hedge requests prove effective in production.

Cache Coherence Trade-off:

Hedging requests to different replicas with L1 in-process caches introduces data consistency challenges:

The scenario:

Impact:

Why no simple fix exists:

Two standard approaches, both with drawbacks:

  1. Reduce L1 TTL (60s → 10s): Increases L2 Valkey load 6× (60% of requests now miss L1 instead of hitting it)
  2. Active invalidation (publish cache eviction events): Adds latency (15ms Kafka publish + propagation), adds complexity (event streaming infrastructure), still has eventual consistency window (100ms instead of 60s)

Recommended approach:

Accept 60-second max staleness as trade-off for 30-40% P99.9 latency improvement. For critical updates requiring immediate consistency (GDPR opt-out, account suspension), implement active invalidation via L2 cache eviction events - trigger explicit Valkey DELETE when these updates occur, forcing all replicas to fetch fresh data from L3.

This is a fundamental distributed caching trade-off, not specific to hedging - any multi-tier cache with in-process L1 faces this challenge. Hedging simply makes the inconsistency more visible by potentially serving requests from replicas in different cache states within single user session.


External API Architecture

The platform exposes three distinct API surfaces for different user personas. Each API has different latency requirements, security models, and rate limiting strategies. Understanding these external interfaces is critical - they’re not implementation details but architectural concerns that shape request flow, authentication overhead, and operational complexity.

Why APIs matter architecturally: The API layer sits on the critical path (contributing 5ms to latency budget), enforces security boundaries (preventing unauthorized access to high-value revenue streams), and manages external load (rate limiting 1M+ QPS from thousands of publishers). Get API design wrong and you either violate latency SLOs, create security vulnerabilities, or waste engineering time debugging integration issues.

Three API types overview:

These APIs integrate with Part 1’s system architecture (API Gateway → Ad Server Orchestrator), Part 3’s cache invalidation patterns (budget updates propagate through L1/L2/L3), and Part 4’s security model (zero-trust, encryption at rest/transit).

Publisher Ad Request API - Critical Path

Purpose and Requirements

This API serves the core ad request flow: mobile apps and websites request ads in real-time. It’s the highest-traffic, most latency-sensitive endpoint in the entire platform.

Latency constraint: P95 < 150ms (matches internal SLO from Part 1’s latency budget decomposition) Throughput: 1M QPS baseline, 1.5M QPS burst capacity (from Part 1’s scale requirements) Availability: 99.9% uptime (43 min/month error budget - same as overall platform SLA)

Why this is critical path: Every millisecond counts. Mobile apps timeout after 150-200ms. If this API breaches budget, users see blank ad slots and we earn zero revenue on those requests.

Endpoint Design

HTTP Method: POST Path: /v1/ad/request Authentication: API Key via X-Publisher-ID header

Why API key instead of OAuth: Latency. OAuth token validation requires JWT signature verification (RSA-2048: 2-3ms) plus potential token introspection calls (5-10ms if not cached). API keys validate via simple distributed cache lookup (0.5ms). At 1M QPS, this 2ms difference consumes 13% of the gateway’s 5ms latency budget.

Rate Limiting: 10K QPS per publisher (tied to SLA tier)

Publishers are tiered (Bronze: 1K QPS, Silver: 5K, Gold: 10K, Platinum: 50K+). Rate limits enforce commercial agreements and prevent single publisher from overwhelming platform capacity.

Request Schema

The request payload contains four categories of data:

User Identity Section (Optional - Signal Loss Reality):

Why user_id is optional: Due to ATT (only ~50% opt-in on iOS, ~27% dual opt-in), cookie blocking (Safari, Firefox), and Privacy Sandbox (Chrome), stable user identity is unavailable for 40-60% of mobile traffic. The system must serve ads without it. When present, user_id enables frequency capping and sequential retargeting. When absent, the system falls back to contextual targeting.

Contextual Signals Section (Always Available):

Why contextual signals are first-class: These signals are always available regardless of identity. While contextual inventory commands lower CPMs than behaviorally-targeted inventory (typically 30-50% lower, though premium placements approach parity), contextual targeting delivers comparable conversion performance - a GumGum/Dentsu study found 48% lower cost-per-click and similar conversion rates. This makes contextual the economically viable fallback for the 40-60% of traffic without stable user_id.

Placement Section:

Device Section:

Why IP included: Essential for two critical functions: (1) Fraud detection (Part 4’s Integrity Check Service) - correlate IP with device fingerprint to detect bot farms, (2) Geo-targeting - advertisers pay premium for location-based campaigns (NYC restaurant targets Manhattan users).

Payload size constraint: < 4KB

Why limit size? At 1M QPS, 4KB requests = 4GB/sec network ingress = 32 Gbps. Keeping payloads compact reduces infrastructure costs and network latency (smaller payloads = faster transmission over TCP).

Response Schema

The response contains the winning ad plus tracking instrumentation:

Ad Metadata:

Tracking URLs:

Why pre-signed URLs: Prevents tracking pixel fraud. Without signatures, malicious publishers could forge impression events by repeatedly calling /v1/events/impression with fabricated data. Pre-signed URLs use HMAC-SHA256 with secret key and 5-minute expiry - only the Ad Server can generate valid tracking URLs.

TTL (Time-To-Live): 300 seconds default

Advertisers want fresh targeting data (user’s interests from 5 minutes ago, not 24 hours ago), but excessive freshness increases server load. 300s (5min) balances these concerns - cache hit rate remains high (80%+) while targeting stays reasonably current.

Integration with System Architecture

Request flow: Client → API Gateway (5ms) → Ad Server Orchestrator → [User Profile, ML, RTB, Auction] → Response

Reference Part 1’s request flow diagram - the Publisher API is the entry point to the entire ad serving critical path. The 5ms gateway latency budget includes API key validation (0.5ms), rate limiting (1ms), and request enrichment (3.5ms for adding geo-location from IP, parsing headers, sanitizing inputs).

Why synchronous: Publishers need immediate responses to render ad content. Asynchronous processing (accept request, return job ID, poll for result) would require publishers to implement complex retry logic and delays ad display by seconds - unacceptable for user experience.

Advertiser Campaign Management API

Purpose and Requirements

Advertisers use this API to create campaigns, adjust budgets, query real-time stats, and manage targeting parameters. Unlike the Publisher API (critical path), these are management operations where 500ms latency is acceptable.

Latency constraint: P95 < 500ms (non-critical path, acceptable to be slower than ad serving) Throughput: 10K req/min (much lower than 1M QPS ad serving - advertisers make tens of API calls per campaign, not millions) Use cases: Dashboard integrations, programmatic campaign optimization, bulk operations

Endpoint Catalog

POST /v1/campaigns - Create campaign

GET /v1/campaigns/{id}/stats - Query real-time performance

PATCH /v1/campaigns/{id}/budget - Adjust spending

DELETE /v1/campaigns/{id} - Pause/stop campaign

Authentication Model

OAuth 2.0 Authorization Code Flow

Why OAuth instead of API keys: Long-lived sessions. Advertisers log into web dashboards for 30-60 minute sessions. OAuth provides:

OAuth’s 2-3ms latency overhead is acceptable here because we have 500ms budget (vs 150ms for Publisher API).

Scope-based permissions:

Stats API Deep-Dive

The challenge: Advertisers expect stats within 5 seconds (not 30 seconds from batch processing), but querying billions of impression/click events in real-time would violate latency budget and overwhelm the transactional database.

Solution: Separate analytics path with pre-aggregated data

Introduce a columnar analytics database (ClickHouse or Apache Druid) optimized for time-series aggregations:

Trade-off: 10-20 seconds staleness (eventual consistency). Events flow: User clicks ad → Kafka → Stream Processor → Analytics DB → Hourly rollup job → Stats API cache. Total pipeline latency: 10-20 seconds.

Why acceptable: Advertisers checking campaign progress don’t need millisecond-accurate counts. Showing 99.6% budget utilization with 20-second lag is fine. Critical financial accuracy (budget enforcement) uses separate strongly-consistent path (Part 3’s atomic operations).

Budget Update Workflow

Advertiser updates budget via PATCH /v1/campaigns/{id}/budget:

  1. Request validated: Check authorization (OAuth scopes), validate new budget > current spend
  2. Database write: Update campaign budget in transactional database (strong consistency required)
  3. Cache invalidation cascade: Propagate change through L1/L2/L3 cache hierarchy

Cache invalidation mechanics (reference Part 3’s cache hierarchy):

Propagation time: 10-20 seconds for all instances to see new budget

Why this doesn’t violate financial accuracy: Budget enforcement uses pre-allocated windows (Part 3’s atomic pacing). Even if some servers see stale budget for 20 seconds, the atomic budget counter in distributed cache enforces spending limits with ≤1% variance. Worst case: slight over-delivery during propagation window, but bounded by pre-allocation limits.

Event Tracking API

Purpose and Requirements

Track impressions (ad displayed), clicks (user tapped ad), and conversions (user installed app or made purchase). This API handles the highest volume - 5× the ad request rate due to retries, duplicates, and background analytics beacons.

Volume: 5M events/sec (5× ad request rate)

Latency: Best-effort (async processing acceptable)

Unlike ad serving (must complete in 150ms), event tracking can tolerate seconds of delay. Analytics dashboards update with 10-30 second lag, and that’s fine.

Endpoint Design

POST /v1/events/impression - Ad displayed POST /v1/events/click - Ad clicked POST /v1/events/conversion - User converted (installed app, purchased product)

Authentication: Pre-signed URLs (embedded in ad response, no API key needed)

The ad response from Publisher API includes impression_url: "/v1/events/impression?ad_id=123&sig=HMAC(...)". The client fires this URL when displaying the ad. HMAC signature validates request authenticity - only the Ad Server could have generated this URL with correct signature.

Design Pattern

Client sends event → API Gateway → Kafka (async) → 200 OK immediately

The API Gateway doesn’t wait for Kafka acknowledgment or downstream processing. It accepts the event, publishes to Kafka, and returns success immediately. This non-blocking pattern achieves sub-10ms response times even at 5M events/sec.

Idempotency via event_id:

Mobile networks are unreliable. Clients retry failed requests, causing duplicate events. To prevent double-counting:

Batching support:

Mobile SDKs batch 10-50 events into single request to reduce network overhead:

POST /v1/events/batch
[
  {"type": "impression", "ad_id": 123, "timestamp": ...},
  {"type": "impression", "ad_id": 456, "timestamp": ...},
  {"type": "click", "ad_id": 123, "timestamp": ...}
]

Batching reduces request count by 10-50×, saving mobile battery and reducing server load.

Why Async is Acceptable

Events serve three purposes:

  1. Analytics dashboards: Advertisers see campaign performance (eventual consistency acceptable - 10-30 sec lag)
  2. Billing reconciliation: Monthly billing reports (eventual consistency acceptable - daily batch jobs)
  3. ML training data: Historical click patterns feed CTR models (eventual consistency acceptable - model retrain daily)

None of these require real-time processing. Trading lower client latency (10ms vs 50ms if we waited for Kafka ack) for eventual consistency (10-30 sec lag) is a clear win.

API Gateway Configuration

Technology Choice Rationale

Reference Part 5’s gateway selection (detailed implementation covered in final architecture post). Requirements for this workload:

Why these requirements matter: At 1M QPS, every millisecond of gateway overhead consumes 0.67% of the 150ms latency budget. Inefficient gateways (10-15ms overhead) would violate SLOs before requests even reach the Ad Server.

Per-API Configuration

Publisher API:

Advertiser API:

Events API:

Cross-Region Routing

Publisher API: Route to nearest region (GeoDNS - minimize latency)

Advertiser API: Route to campaign’s home region (data locality)

Events API: Route to nearest Kafka cluster (minimize network hops)

Rate Limiting Architecture

Multi-tier limits (from Part 1’s rate limiting section):

Distributed cache-backed token bucket:

Why distributed cache: Centralized truth prevents “split-brain” scenarios where different gateway instances enforce different limits. Trade-off: 1ms cache lookup latency (acceptable within 5ms budget) for accurate global limits.

API Versioning Strategy

Versioning Approach

URL-based versioning: /v1/, /v2/, /v3/

Why URL-based instead of header-based:

Backward compatibility: 12 months support for deprecated versions

When releasing /v2/ad/request, we maintain /v1/ad/request for 12 months. Publishers have 1 year to migrate before forced cutoff.

Deprecation Workflow

  1. Announce 6 months in advance (blog post, email, dashboard banner)
  2. Response headers warn clients:
    • X-API-Deprecated: true
    • X-API-Sunset: 2026-01-01 (RFC 8594 Sunset Header)
  3. Migration tools for common patterns (SDK code generators, automated migration scripts)
  4. Forced cutoff after 12 months - /v1 returns HTTP 410 Gone

Breaking Change Examples

Requires new version:

No new version needed:

Why this matters: Breaking changes frustrate developers and damage platform adoption. Clear versioning strategy builds trust - developers know migrations are manageable (12-month window) and predictable (semantic versioning).

Security Model

Authentication Methods

Publisher API: API Keys

Key management: Publishers can create multiple keys (dev, staging, production) with independent rate limits. Compromised key = revoke specific key without disrupting other environments.

Advertiser API: OAuth 2.0

Why 15 min expiry: Balances security (short window for stolen token abuse) vs user experience (refresh tokens silently renew access without re-login).

Events API: Pre-signed URLs

Authorization Granularity

Publisher: Domain whitelisting

Advertiser: Tenant isolation

Why tenant isolation matters: Shared infrastructure (multi-tenant platform) requires strict boundaries. Advertiser A must never see Advertiser B’s campaign data, even through API exploits or SQL injection attempts. Defense-in-depth: API layer enforces authorization, database layer enforces row-level security.

Threat Mitigation

API key leakage:

Token theft (OAuth):

Replay attacks:

API Architecture Diagrams

Diagram 1: API Request Flow

This diagram shows how the three client types (mobile apps, web dashboards, tracking SDKs) connect through the API Gateway to backend services, each with distinct authentication and latency requirements.

    
    graph TB
    subgraph "Client Applications"
        MOBILE[Mobile App
Publisher API] WEB[Web Dashboard
Advertiser API] SDK[Tracking SDK
Events API] end subgraph "API Gateway Layer" GW[Envoy Gateway
Auth + Rate Limiting
2-4ms overhead] end subgraph "Backend Services" AS[Ad Server
Critical Path
150ms SLO] CAMPAIGN[Campaign Service
Non-Critical
500ms SLO] KAFKA[Kafka
Event Streaming
Async] end MOBILE -->|POST /v1/ad/request
API Key| GW WEB -->|GET /v1/campaigns/stats
OAuth 2.0| GW SDK -->|POST /v1/events/impression
Pre-signed URL| GW GW -->|Sync| AS GW -->|Sync| CAMPAIGN GW -->|Async| KAFKA AS -->|Response
ad_creative + tracking_urls| MOBILE CAMPAIGN -->|Response
stats JSON| WEB KAFKA -->|200 OK
Non-blocking| SDK classDef client fill:#e1f5ff,stroke:#0066cc classDef gateway fill:#fff4e1,stroke:#ff9900 classDef service fill:#e8f5e9,stroke:#4caf50 classDef async fill:#ffe0b2,stroke:#e65100 class MOBILE,WEB,SDK client class GW gateway class AS,CAMPAIGN service class KAFKA async

Diagram 2: Authentication Flow Comparison

This diagram illustrates the three authentication methods and their latency trade-offs - API keys for low latency (Publisher), OAuth for security (Advertiser), and pre-signed URLs for volume (Events).

    
        %%{ init: { "flowchart": { "nodeSpacing": 50, "rankSpacing": 80, "curve": "basis", "useMaxWidth": true, "padding": 30 } } }%%
    
    graph LR
    subgraph PUBLISHER ["Publisher API
Low Latency Priority (0.5ms total)"] direction LR P1[Client Request
X-API-Key header] --> P2[Gateway:
Cache lookup
for API key] P2 --> P3[Validation
Key exists
Not revoked
0.5ms] P3 --> P4[Forward to
Ad Server] end subgraph ADVERTISER ["Advertiser API
Security Priority (2-3ms total)"] direction LR A1[Client Request
OAuth Bearer token] --> A2[Gateway:
JWT signature
verification] A2 --> A3[Validation
RSA-2048 signature
Token not expired
Scopes match] A3 --> A4[2ms
validation] --> A5[Forward to
Campaign Service] end subgraph EVENTS ["Events API
Volume Priority (0.3ms total)"] direction LR E1[Client Request
Pre-signed URL
with HMAC] --> E2[Gateway:
HMAC-SHA256
verification] E2 --> E3[Validation
Signature valid
Not expired
0.3ms] E3 --> E4[Forward to
Kafka async] end classDef fast fill:#e6ffe6,stroke:#4caf50,stroke-width:2px classDef medium fill:#fff4e6,stroke:#ff9900,stroke-width:2px classDef ultrafast fill:#ccffcc,stroke:#339933,stroke-width:2px class P1,P2,P3,P4 fast class A1,A2,A3,A4,A5 medium class E1,E2,E3,E4 ultrafast

Section Conclusion

The three API surfaces - Publisher (critical path, 150ms latency), Advertiser (management, 500ms latency), Events (high volume, async) - each have distinct requirements that shape authentication, rate limiting, and infrastructure choices.

Key insights:

Cross-references:

With these API foundations established, the platform has clear external interfaces for publishers (ad serving), advertisers (campaign management), and analytics (event tracking). Next, we’ll explore how the system maintains these SLOs under failure conditions.


Summary: Building a Solid Foundation

This post established the architectural foundation for a real-time ads platform serving 1M+ QPS with 150ms latency targets. The key principles and decisions made here will ripple through all subsequent design choices.

Core Requirements:

Architectural Decisions:

  1. Dual-Source Architecture: Internal ML inventory + External RTB inventory compete in unified auction

    • Parallel execution (ML: 65ms, RTB: 100ms) maximizes revenue within latency budget
    • 100% fill rate through fallback hierarchy
  2. Latency Budget Decomposition: Every millisecond allocated and defended

    • Network: 15ms | User Profile: 10ms | Integrity Check: 5ms
    • Critical path: RTB (100ms) | Auction + Budget: 13ms | Response: 5ms
    • Total: 143ms avg with 7ms safety margin
  3. Resilience Through Degradation: Multi-level fallback preserves availability

    • Circuit breakers detect service degradation (p99 breaches for 60s)
    • Graceful degradation ladder: cached predictions → heuristics → global averages
    • Trade modest revenue loss (8-25%) for 100% availability vs complete outages
  4. P99 Tail Latency Defense: Protecting 10,000 req/sec from timeouts

    • Infrastructure: Low-pause GC runtime (32GB heap, 200 threads per instance)
      • Eliminates GC pauses as P99 contributor (<1ms vs 41-55ms with traditional GC)
      • Calculated from actual workload: 250-400 MB/sec allocation, 5K QPS per instance
    • Operational: 120ms absolute RTB cutoff with forced failure
      • Prevents P99 tail from violating 150ms SLO (would reach 184-198ms)
      • Falls back to internal inventory (40% revenue) vs blank ads (0% revenue)
  5. Rate Limiting: Infrastructure protection + cost control

    • Distributed cache-backed distributed token bucket (centralized truth)
    • Multi-tier limits: global (1.5M QPS), per-IP (10K), per-advertiser (1K-100K)
    • Prevents 20-30% infrastructure overprovisioning for attack scenarios

Why This Foundation Matters:

The architectural decisions made in this foundation phase create the constraints and opportunities that shape the entire system:

Core Insights from This Analysis:

  1. Quantify everything: Latency budgets, failure modes, and trade-offs must be measured, not assumed. Calculate actual GC pause times from allocation rates. Prove circuit breaker thresholds from P99 distributions.

  2. Design for degradation: Perfect availability is impossible at scale. Build graceful degradation paths that trade modest revenue loss (8-25%) for continued operation vs complete outages.

  3. Infrastructure drives SLOs: Language runtime choices (GC), heap sizing, and thread pool configuration aren’t implementation details - they determine whether you meet or violate latency SLOs at P99.

  4. Parallel execution is mandatory: With 150ms total budget and 100ms external dependencies, sequential operations violate SLOs. The dual-source architecture with parallel ML and RTB execution is a requirement, not an optimization.

  5. Financial accuracy shapes consistency models: Advertiser budgets demand strong consistency (≤1% variance), while user profiles tolerate eventual consistency. Choose the right model for each data type based on business impact.


Back to top