Dual-Source Revenue Engine: OpenRTB & ML Inference Pipeline

Introduction: The Revenue Engine

Ad platforms face a fundamental challenge: maximize revenue while meeting strict latency constraints. The naive approach - relying solely on external real-time bidding (RTB) or only internal inventory - leaves significant revenue on the table:

The solution is a dual-source architecture that parallelizes two independent revenue streams:

  1. Internal ML Path (65ms): Score direct-deal inventory using CTR prediction models
  2. External RTB Path (100ms): Broadcast to 50+ DSPs for programmatic bids

Both complete within the 150ms latency budget, then compete in a unified auction. This architecture generates 30-48% more revenue than single-source approaches (baseline revenue vs 52-70% lower revenue) by:

What this post covers:

This post implements the revenue engine with concrete technical details:

The engineering challenge:

Execute 50+ parallel network calls (RTB) AND run ML inference within 100ms total budget. Handle inevitable timeouts gracefully (DSPs fail, network delays, geographic distance). Ensure both paths contribute fair bids to the unified auction. Do all of this at 1M+ queries per second with consistent P99 latency.

Broader applicability:

The patterns explored here - parallel execution with synchronization points, adaptive timeout handling, cost-efficient ML serving, unified decision logic - apply beyond ad tech to any revenue-optimization system with real-time requirements. This demonstrates extracting maximum value from independent data sources under strict latency constraints.

Let’s dive into how this works in practice.

Real-Time Bidding (RTB) Integration

Ad Inventory Model and Monetization Strategy

Before diving into OpenRTB protocol mechanics, understanding the business model is essential. Modern ad platforms monetize through two complementary inventory sources that serve different strategic purposes.

Architectural Driver: Revenue Maximization - Dual-source inventory (internal + external) maximizes fill rate, ensures guaranteed delivery, and captures market value through real-time competition. This model generates 30-48% more revenue than single-source approaches.

What is Internal Inventory?

Internal Inventory refers to ads from direct business relationships between the publisher and advertisers, stored in the publisher’s own database with pre-negotiated pricing. This contrasts with external RTB, where advertisers bid in real-time through programmatic marketplaces.

Four types of internal inventory:

  1. Direct Deals: Sales team negotiates directly with advertiser

    • Example: Nike pays negotiated CPM for 1M impressions on sports pages over 3 months
    • Revenue: Predictable, guaranteed income
    • Use case: Premium brand relationships, custom targeting
  2. Guaranteed Campaigns: Contractual commitment to deliver specific impressions

    • Example: “Deliver 500K impressions to males 18-34 at premium CPM”
    • Publisher must deliver or face penalties; gets priority in auction
    • Use case: Campaign-based advertising with volume commitments
  3. Programmatic Guaranteed: Automated direct deals with fixed price/volume

    • Same economics as direct deals but transacted via API
    • Use case: Automated campaign management at scale
  4. House Ads: Publisher’s own promotional content (NOT paid advertising inventory)

    • What they are: Publisher’s internal promotions like “Subscribe to newsletter”, “Download our app”, “Follow us on social media”, “Upgrade to premium”
    • Revenue: No advertising revenue - generates zero revenue because no external advertiser is paying
    • Value: Still beneficial for publisher (drives newsletter signups, app downloads, user engagement, brand building)
    • Use case: Last-resort fallback when:
      • RTB auction timed out (no external bids arrived), AND
      • All paid internal inventory is exhausted or budget-depleted
      • Better to show promotional content than blank ad space (blank ads damage user trust and long-term CTR)
    • Important distinction: House Ads are fundamentally different from paid internal inventory (direct deals, guaranteed campaigns) which generate actual advertising revenue

Storage: Internal ad database (CockroachDB) storing:

All internal inventory has base CPM pricing determined through negotiation, not real-time bidding.

Why ML Scoring on Internal Inventory?

The revenue optimization problem: Base pricing doesn’t reflect user-specific value. Two users seeing the same ad have vastly different engagement probabilities.

Example scenario:

Ads:

Users:

Without ML (naive ranking by base price):

With ML personalization:

Revenue formula: $$eCPM_{internal} = \text{predicted\_CTR} \times \text{base\_CPM} \times 1000$$

Impact: ML personalization increases internal inventory revenue by 15-40% over naive base-price ranking by matching ads to users most likely to engage.

ML model inputs:

Implementation: GBDT model (40ms latency) predicts CTR for 100 candidate ads, converts to eCPM, outputs ranked list.

Why Both Internal AND External Sources?

Modern ad platforms require both inventory sources for economic viability.

Internal-only limitations:

External-only limitations:

Dual-source optimum:

Source% ImpressionsCharacteristicsDaily Revenue (100M impressions)
Guaranteed campaigns25%Contractual, high priorityBaseline × 40% (2× avg eCPM)
Direct deals10%Negotiated, premium pricingBaseline × 12% (1.5× avg eCPM)
External RTB60%Fills unsold inventoryBaseline × 48% (baseline eCPM)
House ads5%Publisher’s own promos - fallback when paid inventory exhaustedNo ad revenue (not paid advertising)
TOTAL100%All slots filledBaseline revenue

Why dual-source matters: The single-source tradeoff

Each approach alone has critical weaknesses:

Internal-only (guaranteed + direct deals): High-value inventory but limited scale

RTB-only (external marketplace): High fill rate but misses premium pricing

Dual-source unified auction: Combines premium pricing with full coverage

The key insight: internal and external inventory compete in the same auction. Highest eCPM wins regardless of source, ensuring premium relationships stay profitable while RTB fills gaps.

External RTB: Industry-Standard Programmatic Marketplace

Protocol: OpenRTB 2.5 - industry standard for real-time bidding

How RTB works:

  1. Ad server broadcasts bid request to 50+ DSPs with user context
  2. DSPs run their own ML internally and respond with bids within 100ms
  3. Ad server collects responses: [(DSP_A, eCPM_high), (DSP_B, eCPM_mid), ...]
  4. DSP bids already represent eCPM (no additional scoring needed by publisher)

Why no ML re-scoring on RTB bids:

Latency: 100ms timeout (industry standard, critical path bottleneck)

Revenue implications: RTB provides market-driven pricing. When demand is high, bids increase automatically. When low, internal inventory fills gaps - ensuring revenue stability.

The sections below detail OpenRTB protocol implementation, timeout handling, and DSP integration mechanics.

OpenRTB Protocol Deep Dive

The OpenRTB 2.5 specification defines the standard protocol for programmatic advertising auctions.

Note on Header Bidding vs Server-Side RTB: This architecture focuses on server-side RTB where the ad server orchestrates auctions on the backend.

Header bidding (client-side auctions) now dominates programmatic advertising, accounting for ~70% of revenue for many publishers. It trades higher latency (adds 100-200ms client-side) for better auction competition by having browsers run parallel auctions before page load.

Strategic choice:

A typical server-side RTB request-response cycle:

    
    sequenceDiagram
    participant AdServer as Ad Server
    participant DSP1 as DSP #1
    participant DSP2 as DSP #2-50
    participant Auction as Auction Logic

    Note over AdServer,Auction: 150ms Total Budget

    AdServer->>AdServer: Construct BidRequest
OpenRTB 2.x format par Parallel DSP Calls (100ms timeout each) AdServer->>DSP1: HTTP POST /bid
OpenRTB BidRequest activate DSP1 DSP1-->>AdServer: BidResponse
Price: eCPM bid deactivate DSP1 and AdServer->>DSP2: Broadcast to 50 DSPs
Parallel connections activate DSP2 DSP2-->>AdServer: Multiple BidResponses
[eCPM_1, eCPM_2, ...] deactivate DSP2 end Note over AdServer: Timeout enforcement:
Discard late responses AdServer->>Auction: Collected bids +
ML CTR predictions Auction->>Auction: Run First-Price Auction
Highest eCPM wins Auction-->>AdServer: Winner + Price AdServer-->>DSP1: Win notification
(async, best-effort) Note over AdServer,Auction: Total elapsed: ~35ms

OpenRTB BidRequest Structure:

The ad server sends a JSON request to DSPs (OpenRTB 2.5+):

{
  "id": "req_a3f8b291",
  "imp": [
    {
      "id": "1",
      "banner": {
        "w": 320,
        "h": 50,
        "pos": 1,
        "format": [
          {"w": 320, "h": 50},
          {"w": 300, "h": 250}
        ]
      },
      "bidfloor": 0.50,
      "bidfloorcur": "USD",
      "tagid": "mobile-banner-top"
    }
  ],
  "app": {
    "id": "app123",
    "bundle": "com.example.myapp",
    "name": "MyApp",
    "publisher": {
      "id": "pub-456"
    }
  },
  "device": {
    "ua": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0_1...)",
    "ip": "192.0.2.1",
    "devicetype": 1,
    "make": "Apple",
    "model": "iPhone15,2",
    "os": "iOS",
    "osv": "17.0.1"
  },
  "user": {
    "id": "sha256_hashed_device_id",
    "geo": {
      "country": "USA",
      "region": "CA",
      "city": "San Francisco",
      "lat": 37.7749,
      "lon": -122.4194
    }
  },
  "at": 2,
  "tmax": 100,
  "cur": ["USD"]
}

Key fields (per OpenRTB 2.5 spec):

OpenRTB BidResponse Structure:

DSPs respond with their bid (OpenRTB 2.5+):

{
  "id": "req_a3f8b291",
  "bidid": "bid-response-001",
  "seatbid": [
    {
      "seat": "dsp-seat-123",
      "bid": [
        {
          "id": "1",
          "impid": "1",
          "price": 2.50,
          "adid": "ad-789",
          "cid": "campaign-456",
          "crid": "creative-321",
          "adm": "<div><a href='https://example.com'><img src='https://cdn.example.com/ad.jpg'/></a></div>",
          "adomain": ["example.com"],
          "iurl": "https://dsp.example.com/creative-preview.jpg",
          "w": 320,
          "h": 50
        }
      ]
    }
  ],
  "cur": "USD"
}

Key fields (per OpenRTB 2.5 spec):

RTB Timeout Handling and Partial Auctions

With 50 DSPs and 100ms timeout, some responses inevitably arrive late. Three strategies handle partial auctions:

Strategy 1: Hard Timeout

Strategy 2: Adaptive Timeout

Track per-DSP latency histograms \(H_{dsp}\) and set individualized timeouts:

$$T_{dsp} = \text{min}\left(P_{95}(H_{dsp}), T_{global}\right)$$

where \(P_{95}(H_{dsp})\) is the 95th percentile latency for each DSP, capped at \(T_{global} = 100ms\).

Implementation Details:

Data Structure: HdrHistogram (High Dynamic Range Histogram)

Storage & Update:

Cold Start Handling:

Operational Flow:

Each Ad Server instance maintains an in-memory map of DSP identifiers to their latency histograms. When a DSP response arrives, the latency is recorded asynchronously into that DSP’s histogram without blocking the critical path. When initiating a new RTB request, the system queries the histogram for that DSP’s P95 latency - if the histogram exists and has sufficient samples (≥100), use the P95 value capped at 100ms; otherwise, use the global default of 100ms.

Example Scenario:

This allows fast, reliable DSPs to contribute to lower overall latency (saving 20-30ms on the critical path) while protecting against slow DSPs that would violate the budget.

Trade-off Analysis:

Strategy 3: Progressive Auction

Mathematical Model:

Let \(B_i\) be the bid from DSP \(i\) with arrival time \(t_i\). The auction winner at time \(t\):

$$W(t) = \arg\max_{i: t_i \leq t} B_i \times \text{CTR}_i$$

Revenue optimization: $$\mathbb{E}[\text{Revenue}] = \sum_{i=1}^{N} P(t_i \leq T) \times B_i \times \text{CTR}_i$$

This shows the expected revenue decreases as timeout \(T\) decreases (fewer DSPs respond).

Connection Pooling and HTTP/2 Multiplexing

To minimize connection overhead for 50+ DSPs:

HTTP/1.1 Connection Pooling:

Example: 1000 QPS, 100ms latency, 10 servers → 10 connections per server

HTTP/2 Benefits:

What about gRPC?

gRPC is excellent for internal services but faces a key constraint: OpenRTB is a standardized JSON/HTTP protocol. External DSPs expect HTTP REST endpoints per IAB spec.

Hybrid approach:

Latency Improvement:

Connection setup time \(T_{conn}\):

Latency savings: ~50ms per cold start - important for minimizing tail latency in RTB auctions.

Geographic Distribution and Edge Deployment

Latency Impact of Distance:

Network latency is fundamentally bounded by the speed of light in fiber:

$$T_{propagation} \geq \frac{d}{c \times 0.67}$$

where \(d\) is distance, \(c\) is speed of light, 0.67 accounts for fiber optic refractive index[^fiber-refractive].

Example: New York to London (5,585 km): $$T_{propagation} \geq \frac{5,585,000m}{3 \times 10^8 m/s \times 0.67} \approx 28ms$$

Important: This 28ms is the theoretical minimum - the absolute best case if light could travel in a straight line through fiber with zero processing delays.

Real-world latency is 2.5-3× higher due to:

Measured latency NY-London in practice: 80-100ms round-trip (vs 28ms theoretical minimum).

This demonstrates why latency budgets must account for real-world networking overhead, not just theoretical limits. The 100ms RTB maximum timeout (industry standard fallback) is impossible to achieve for global DSPs without geographic sharding - regional deployment is mandatory, not optional, to minimize distance and achieve practical 50-70ms response times.

Optimal DSP Integration Points:

Deploy RTB auction services in:

  1. US East (Virginia): Proximity to major ad exchanges
  2. US West (California): West coast advertisers
  3. EU (Amsterdam/Frankfurt): GDPR-compliant EU auctions
  4. APAC (Singapore): Asia-Pacific market

Latency Reduction:

With regional deployment, max distance reduced from 10,000km to ~1,000km: $$T_{propagation} \approx \frac{1,000,000m}{3 \times 10^8 m/s \times 0.67} \approx 5ms$$

Again, this is theoretical minimum. Practical regional latency (within 1,000km): 15-25ms round-trip including routing overhead.

Savings: From 80-100ms (global) to 15-25ms (regional) = 55-75ms reduction, allowing significantly more regional DSPs to respond within practical 50-70ms operational timeouts while maintaining high response rates.

RTB Geographic Sharding and Timeout Strategy

Architectural Driver: Latency - Physics constraints make global DSP participation within 100ms impossible. Geographic sharding with aggressive early termination (50-70ms cutoff) captures 95%+ revenue while maintaining sub-150ms SLO.

The 100ms Timeout Reality:

While OpenRTB documentation cites 100ms tmax timeouts, production reality requires more aggressive cutoffs:

Why the discrepancy? The 100ms timeout is your failure deadline, not your target. High-performing platforms aim for 50-70ms p80 to maximize auction quality.

Geographic Sharding Architecture:

Regional clusters call only geographically proximate DSPs:

RegionCalls DSPs inAvg RTTResponse Rate (80ms cutoff)DSPs Called
US-EastUS + Canada15-30ms92-95%20-25 regional + 10 premium
EU-WestEU + EMEA10-25ms93-96%25-30 regional + 10 premium
APACAsia-Pacific15-35ms88-92%15-20 regional + 10 premium

Premium Tier (10-15 DSPs): High-value DSPs (Google AdX, Magnite, PubMatic) called globally regardless of latency - their bid value justifies lower response rate (65-75%).

How Premium Tier DSPs Achieve Global Coverage Within Physics Constraints:

Major DSPs operate multi-region infrastructure with geographically-distributed endpoints, enabling “global” coverage without violating latency budgets:

Regional endpoint architecture:

Request routing per region:

What “called globally” means:

Smaller DSPs without multi-region infrastructure (most Tier 2/3 DSPs) operate single endpoints and are assigned to specific regions only. For example, “BidCo” with a single US datacenter is only called from US-East/West clusters, not from EU or APAC.

Configuration example:

Premium DSP configuration (e.g., Google AdX):

This architecture resolves the apparent contradiction: premium DSPs are “globally available” (all users can access them) while respecting the 50-70ms operational latency target (each region calls local endpoints only).

Dynamic Bidder Health Scoring:

Multi-dimensional scoring (updated hourly):

$$Score_{DSP} = 0.3 \times S_{latency} + 0.25 \times S_{bid rate} + 0.25 \times S_{win rate} + 0.2 \times S_{value}$$

Tier Assignment:

TierScore RangeTreatmentTypical Count
Tier 1 (Premium)>80Always call from all regions10-15 DSPs
Tier 2 (Regional)50-80Call if same region + healthy20-25 DSPs
Tier 3 (Opportunistic)30-50Call only for premium inventory10-15 DSPs
Tier 4 (Excluded)<30 OR P95>100msSKIP entirely (egress cost savings)5-10 DSPs

Note: Tier assignment also incorporates P95 latency for cost optimization. See Egress Bandwidth Cost Optimization section below for detailed predictive timeout calculation and Tier 4 exclusion logic that achieves 45% egress cost reduction.

Early Termination Strategy:

Progressive timeout tiers:

Trade-off: Waiting 70ms→100ms (+30ms) yields only +1-2% revenue. Not worth the latency cost.

Revenue Impact Model:

$$\text{Revenue}(t) = \sum_{i=1}^{N} P(\text{DSP}_i \text{ responds by } t) \times E[\text{bid}_i] \times \text{CTR}_i$$

Empirical data:

TimeoutDSPs RespondingRevenue (% of max)Latency Impact
50ms30-35 (70%)85-88%Excellent (fast UX)
70ms40-45 (85%)95-97%Good (target)
80ms45-48 (90%)97-98%Acceptable
100ms48-50 (95%)98-99%Slow (diminishing returns)

Monitoring:

Metrics tracked per DSP (hourly aggregation):

Alerts:

Implementation: DSP Selection and Request Cancellation

DSP Selection Logic (Pre-Request Filtering):

The bidder health scoring system actively skips slow DSPs before making requests, not just timing them out after sending:

DSP Selection Algorithm:

For each incoming ad request:

  1. Determine user region from IP address (US-East, EU-West, or APAC)

  2. Calculate health score for each DSP (based on latency, bid rate, win rate, value)

  3. Assign tier based on health score threshold

  4. Apply tier-specific selection logic:

    • Tier 1 (Premium): Always include, regardless of region - multi-region endpoints ensure low latency
    • Tier 2 (Regional): Include only if same region AND score > 50, else SKIP (avoids cross-region latency)
    • Tier 3 (Opportunistic): Include only for premium inventory AND score > 30, else SKIP (saves bandwidth)
  5. Result: ~25-30 selected DSPs (not all 50)

  6. Savings: ~40% fewer HTTP requests, reduced bandwidth and tail latency

Request Cancellation Pattern:

Algorithm for parallel DSP requests with timeout:

    
    flowchart TD
    Start[Start RTB Auction] --> Context[Create 70ms timeout context]
    Context --> FanOut[Fan-out: Launch parallel HTTP requests
to 25-30 selected DSPs] FanOut --> Fast[Fast DSPs 20-30ms] FanOut --> Medium[Medium DSPs 40-60ms] FanOut --> Slow[Slow DSPs 70ms+] Fast --> Collect[Progressive Collection:
Stream bids as they arrive] Medium --> Collect Slow --> Timeout{70ms
timeout?} Timeout -->|Before timeout| Collect Timeout -->|After timeout| Cancel[Cancel pending requests] Cancel --> RST[HTTP/2: Send RST_STREAM
HTTP/1.1: Close connection] RST --> Record[Record timeout per DSP
for health scores] Collect --> Check{Collected
sufficient bids?} Record --> Check Check -->|Yes 95-97%| Auction[Proceed to auction with
available responses] Check -->|No| Auction Auction --> End[Return winning bid] style Timeout fill:#ffa style Cancel fill:#f99 style Auction fill:#9f9

Key behaviors:

Key Implementation Details:

  1. Pre-request filtering: Tier 3 DSPs don’t receive requests for normal inventory → saves ~20-25 HTTP requests per auction
  2. Progressive collection: Bids collected as they arrive (streaming), not blocking until timeout
  3. Graceful cancellation: HTTP/2 stream-level cancellation (RST_STREAM) preserves connection pool
  4. Monitoring integration: Record timeouts per DSP to update health scores hourly

Statistical Clarification:

The 100ms timeout is a p95 target across all DSPs in a single auction, not per-DSP mean:

With 25-30 DSPs per auction, the probability that at least one times out increases. The 70ms target mitigates this tail latency risk.

The 100ms RTB Timeout: Why Multi-Tier Optimization is Mandatory

Industry Context: This architecture uses a 100ms timeout for DSP responses, aligning with industry standard OpenRTB implementations (IAB OpenRTB tmax field). However, as demonstrated in the physics analysis and geographic sharding section above, achieving this timeout with global DSP participation is impossible without aggressive optimization. This section explains the constraint and why the multi-tier approach (geographic sharding + bidder health scoring + early termination) is not optional - it’s mandatory.

The IAB OpenRTB specification defines a tmax field (maximum time in milliseconds) but does not mandate a specific value. Real-world implementations vary:

The Physics Reality:

Network latency is fundamentally bounded by the speed of light. For global DSP communication (showing theoretical minimums - real-world latency is 2-3× higher due to routing overhead):

RouteDistanceMin Latency
(one-way)
Round-trip
(theoretical)
Practical Round-tripAvailable time for DSP
US-East → US-West4,000 km~13ms~26ms~60-80ms-30 to -50ms
impossible!
US → Europe6,000 km~20ms~40ms~100-120ms-70 to -90ms
impossible!
US → Asia10,000 km~33ms~66ms~150-200ms-120 to -170ms
impossible!
Europe → Asia8,000 km~27ms~54ms~120-150ms-90 to -120ms
impossible!

Mathematical reality:

$$T_{RTB} = T_{\text{network to DSP}} + T_{\text{DSP processing}} + T_{\text{network from DSP}}$$

For a DSP in Singapore processing a request from New York (using practical latency measurements):

Even the theoretical physics limit (66ms one-way, 132ms round-trip) would challenge a 100ms budget, and practical networking makes it far worse.

Why the 100ms timeout enables global DSP participation:

With regional deployment and intelligent DSP selection:

The 100ms budget accepts that some global DSPs will timeout, but captures enough responses to maximize auction competition while maintaining user experience (within 150ms total SLO).

Why we can’t just increase the timeout:

The 150ms total budget breaks down into three phases: sequential startup, parallel execution (where RTB is the bottleneck), and final sequential processing.

    
    gantt
    title Request Latency Breakdown (150ms Budget)
    dateFormat x
    axisFormat %L

    section Sequential 0-25ms
    Network overhead 10ms      :done, 0, 10
    Gateway 5ms                :done, 10, 15
    User Profile 10ms          :done, 15, 25

    section Parallel ML Path
    Feature Store 10ms         :active, 25, 35
    Ad Selection 15ms          :active, 35, 50
    ML Inference 40ms          :active, 50, 90
    Idle wait 35ms             :90, 125

    section Parallel RTB Path
    RTB Auction 100ms          :crit, 25, 125

    section Final 125-150ms
    Auction + Budget 8ms       :done, 125, 133
    Serialization 5ms          :done, 133, 138
    Buffer 12ms                :138, 150

Before parallel execution (30ms): Network overhead (10ms), gateway routing (5ms), user profile lookup (10ms), and integrity check (5ms) must complete sequentially before the parallel ML/RTB phase begins.

Parallel execution phase: Two independent paths start at 30ms (after User Profile + Integrity Check):

After synchronization (13ms avg, 15ms p99): Once RTB completes at 130ms, we run Auction Logic (3ms), Budget Check (3ms avg, 5ms p99) via Redis Lua script, add overhead (2ms), and serialize the response (5ms), reaching 143ms avg (145ms p99). The budget check uses Redis Lua script for atomic check-and-deduct (detailed in the budget pacing section of Part 3).

Buffer (5-7ms): Leaves 5-7ms headroom to reach the 150ms SLO, accounting for network variance and tail latencies. The 5ms Integrity Check investment is justified by massive annual savings in RTB bandwidth costs (eliminating 20-30% fraudulent traffic before DSP fan-out).

Key constraint: Increasing RTB timeout beyond 100ms directly increases total latency. A 150ms RTB timeout would push total latency to 185ms (150 RTB + 25 startup + 10 final), violating the 150ms SLO by 35ms.

Key architectural insight: RTB auction (100ms) is the critical path - it dominates the latency budget. The internal ML path (Feature Store 10ms + Ad Selection 15ms + ML Inference 40ms = 65ms) completes well before RTB responses arrive, so they run in parallel without blocking each other.

Why 100ms RTB timeout is the p95 target (with p99 protection at 120ms):

The 150ms SLO: The 150ms total latency provides good user experience (mobile apps timeout at 200-300ms) while accommodating industry-standard RTB mechanics. However, meeting this SLO requires the multi-tier optimization approach described earlier.

Why Regional Sharding + Bidder Health Scoring are Mandatory (not optional)

The physics constraints demonstrated above make it clear: regional sharding is not an optimization - it’s a mandatory requirement. Without geographic sharding, dynamic bidder selection, and early termination, the 100ms RTB budget is impossible to achieve:

    
    graph TB
    subgraph "User Request Flow"
        USER[User in New York]
    end

    subgraph "Regional DSP Sharding"
        ADV[Ad Server
US-East-1] ADV -->|5ms RTT| US_DSPS[US DSP Pool
25 partners
Latency: 15ms avg] ADV -.->|40ms RTT| EU_DSPS[EU DSP Pool
15 partners
SKIPPED - too slow] ADV -.->|66ms RTT| ASIA_DSPS[Asia DSP Pool
10 partners
SKIPPED - too slow] US_DSPS -->|Response| ADV end subgraph "Smart DSP Selection" PROFILE[(DSP Performance Profile
Cached in Redis)] PROFILE -->|Lookup| SELECTOR[DSP Selector Logic] SELECTOR --> DECISION{Distance vs
Historical Bid Value} DECISION -->|High value,
close proximity| INCLUDE[Include in auction] DECISION -->|Low value or
distant| SKIP[Skip to meet latency] end USER --> ADV ADV --> PROFILE classDef active fill:#ccffcc,stroke:#00cc00,stroke-width:2px classDef inactive fill:#ffcccc,stroke:#cc0000,stroke-width:2px,stroke-dasharray: 5 5 classDef logic fill:#e3f2fd,stroke:#1976d2,stroke-width:2px class US_DSPS,INCLUDE active class EU_DSPS,ASIA_DSPS,SKIP inactive class PROFILE,SELECTOR,DECISION logic

Regional Sharding Strategy:

DSP Selection Algorithm:

For each auction request, select DSPs based on multi-criteria optimization:

DSP Selection Criteria (include if any condition is met):

where:

Optimization objective:

$$\max \sum_{i \in \text{Selected}} P_i \times V_i \quad \text{subject to } \max(L_i) \leq 100ms$$

Maximize expected revenue while respecting latency constraint.

Impact of regional sharding:

Revenue trade-off:

Optimization 2: Selective DSP Participation

With a 100ms timeout budget, prioritize DSPs based on historical performance metrics rather than geography alone:

DSP Selection Criteria:

DSP CharacteristicsStrategyReasoning
High-value, responsive
(avg bid >2× baseline, p95 latency <80ms)
Always includeBest revenue potential with reliable response
Medium-value, responsive
(avg bid 0.75-2× baseline, p95 latency <80ms)
IncludeGood balance of revenue and reliability
Low-value or slow
(avg bid <0.75× baseline or p95 >90ms)
Evaluate ROIMay skip to reduce tail latency
Inconsistent bidders
(bid rate <30%)
Consider removalUnreliable participation wastes auction slots

Performance-Based Routing:

For each auction, the system:

  1. Selects DSPs based on historical performance:
    • Historical p95 latency < 80ms
    • Bid rate > 50%
    • Average bid value justifies inclusion cost
  2. Sends bid requests to selected DSPs in parallel
  3. Waits up to 100ms for responses
  4. Proceeds with whatever bids have arrived by the deadline

Monitoring & Validation:

Monitor per-DSP metrics:

Automatically demote underperforming DSPs or increase timeout threshold for consistently slow but high-value partners (up to 120ms).

Theoretical impact:

Based on the physics constraints shown above, regional sharding should yield:

Conclusion:

The 100ms RTB timeout aligns with industry-standard practices, but achieving it requires mandatory multi-tier optimization (not optional enhancements). The three-layer defense is essential:

  1. Geographic sharding (mandatory): Regional ad server clusters call geographically-local DSPs only (15-25ms RTT vs 200-300ms global)
  2. Dynamic bidder health scoring (mandatory): De-prioritize/skip slow DSPs before making requests based on p50/p95/p99 latency tracking and revenue contribution
  3. Adaptive early termination (mandatory): 50-70ms operational target with progressive timeout ladder (not 100ms as primary goal)

Architectural Driver: Latency + Revenue - The 100ms RTB timeout is the absolute fallback deadline, not the operational target. The multi-tier optimization approach achieves 60-70ms typical latency while capturing 95-97% of revenue, making the 150ms total SLO achievable with real-world network physics.

Reality of this approach:

Cascading Timeout Strategy: Maximizing Revenue from Slow Bidders

Architectural Driver: Revenue Optimization - The traditional approach (wait 100ms for all DSP responses before running auction) leaves revenue on the table. A cascading auction mechanism harvests fast responses for low-latency users while still capturing late bids for revenue optimization.

The Problem with Single-Timeout Auctions:

Traditional RTB integration uses a single timeout: wait until 100ms deadline, collect all responses, run one unified auction. This creates a tradeoff:

The Cascading Solution: Staged Bid Harvesting

Instead of a binary timeout, implement a progressive auction ladder that runs multiple auctions at different thresholds:

Stage 1 - Fast Track Auction (50ms deadline):

Stage 2 - Revenue Maximization Auction (80-100ms deadline):

Stage 3 - Absolute Cutoff (120ms hard deadline):

Cascading Auction Flow:

    
    sequenceDiagram
    participant User
    participant AdServer
    participant DSPs as 50 DSPs
    participant Analytics

    Note over AdServer: t=0ms: Request arrives
    AdServer->>DSPs: Broadcast bid requests (parallel)

    Note over AdServer: t=50ms: Stage 1 Checkpoint
    DSPs-->>AdServer: Fast responses (70-80% of DSPs)
    AdServer->>AdServer: Run Stage 1 auction
(ML ads + fast DSP bids) AdServer->>User: Deliver winning ad (Stage 1) AdServer->>Analytics: Log Stage 1 winner Note over AdServer: t=100ms: Stage 2 Checkpoint (async) DSPs-->>AdServer: Late responses (remaining 20-30%) AdServer->>AdServer: Run Stage 2 auction
(all bids collected) AdServer->>Analytics: Log revenue differential
(Stage2 eCPM - Stage1 eCPM) alt Stage 2 winner significantly better (>5% eCPM) AdServer->>AdServer: Upgrade billing to Stage 2 winner Note over AdServer: Publisher gets higher revenue
User already saw Stage 1 ad else Stage 2 winner not materially better AdServer->>AdServer: Keep Stage 1 billing end Note over AdServer: t=120ms: Stage 3 Absolute Cutoff AdServer->>DSPs: Cancel remaining connections AdServer->>Analytics: Log P99 protection trigger

Operational Flow:

Phase 1 - Request Initiation (t=0ms):

Phase 2 - Fast Track Harvest (t=50ms):

Phase 3 - Revenue Optimization (t=100ms, async):

Phase 4 - Safety Cutoff (t=120ms, forced):

Revenue Impact Analysis:

Real-world latency distributions show diminishing returns beyond 50ms:

TimeoutDSP Response RateRevenue CaptureLatency Impact
30ms45-55%70-75%Optimal UX, significant revenue loss
50ms70-80%85-90%Excellent UX, minor revenue loss
80ms90-95%95-98%Acceptable UX, minimal revenue loss
100ms95-97%99-100%Marginal UX, maximum revenue
120ms+98-100%100%Poor UX, violates SLO

Key insight: Going from 50ms to 100ms adds 50ms latency but only captures an extra 10-15% revenue. The cascading approach gets both - 50ms user experience AND 100% revenue capture.

Why This Works:

  1. User sees fast ad: Stage 1 delivers in 65ms total (50ms RTB + 15ms overhead)
  2. Publisher gets maximum revenue: Stage 2 billing uses highest bid from full auction
  3. DSP fairness: All DSPs get chance to participate (within physics constraints)
  4. P99 protection: 120ms absolute cutoff prevents tail latency violations

Analytics and Optimization:

Track Stage 1 vs Stage 2 revenue differential to optimize timeout thresholds. Daily analytics should measure:

Key metrics:

Data collection:

Typical findings:

When to Use Single-Stage vs Cascading:

Single-stage auction (80-100ms) makes sense when:

Cascading auction (50ms + 100ms) makes sense when:

Our choice: Cascading auctions for mobile inventory (70% of traffic), single-stage for desktop (30%).

Trade-off Articulation:

This cascading approach is not free - it adds operational complexity:

Complexity added:

Complexity justified by:

Implementation requirements:

Egress Bandwidth Cost Optimization: Predictive DSP Timeouts

Architectural Driver: Cost Efficiency - Egress bandwidth is the largest variable operational cost in RTB integration. At 1M QPS sending requests to 50+ DSPs, the platform pays for every byte sent to DSPs, regardless of whether they respond in time or win the auction. Optimizing which DSPs receive requests and with what timeouts directly impacts infrastructure costs.

The Egress Bandwidth Problem:

RTB integration involves sending HTTP POST requests (2-8KB each) to dozens of external DSPs for every ad request. At scale, this creates massive egress bandwidth costs:

Bandwidth Calculation at 1M QPS:

The Waste: DSPs that consistently respond slowly (>100ms) rarely win auctions due to the 150ms total SLO constraint. Yet the platform still pays full egress costs to send them bid requests.

Example of waste:

Solution: DSP Performance Tier Service with Predictive Timeouts

Instead of using a global 100ms timeout for all DSPs, dynamically adjust timeout per DSP based on historical performance, and skip DSPs that won’t respond in time.

DSP Performance Tier Service Architecture:

This is a dedicated microservice that:

  1. Tracks P50, P95, P99 latency for every DSP (hourly rolling window)
  2. Calculates predictive timeout for each DSP
  3. Assigns DSPs to performance tiers
  4. Provides real-time lookup for ad server (via Redis cache, <1ms lookup)

Latency Budget Impact:

The DSP performance lookup adds 1ms to the RTB auction phase and is accounted for within the existing 100ms RTB budget:

RTB Phase Breakdown (100ms total):

Key point: The 1ms lookup happens at the start of the RTB phase and reduces the effective fan-out budget from 100ms to 99ms. This is acceptable because:

Trade-off: Spend 1ms upfront to save 20-30ms on average through smarter DSP selection and dynamic timeouts. The ROI is 20:1 to 30:1 in latency savings.

Predictive Timeout Calculation:

For each DSP, calculate dynamic timeout based on historical latency:

$$T_{DSP} = \min(P95_{DSP} + \text{safety margin}, T_{max})$$

Where:

Example calculations:

DSPP95 Latency (1h)Predictive TimeoutAction
Google AdX35msmin(35+10, 100) = 45msInclude with short timeout
Magnite55msmin(55+10, 100) = 65msInclude with medium timeout
Regional DSP A25msmin(25+10, 100) = 35msInclude with very short timeout
SlowBid Inc145msmin(145+10, 100) = 100msInclude but likely timeout
UnreliableDSP180msExceeds 150msSKIP entirely (pre-filter)

Enhanced Tier Assignment with Cost Optimization:

Extend the existing 3-tier system to incorporate egress cost optimization:

TierLatency ProfilePredictive TimeoutTreatmentEgress Savings
Tier 1 (Premium)P95 < 50msP95 + 10ms (dynamic)Always call, optimized timeoutMinimal waste
Tier 2 (Regional)P95 50-80msP95 + 10ms (dynamic)Call if same region15-25% reduction
Tier 3 (Opportunistic)P95 80-100msP95 + 10ms (capped at 100ms)Call only premium inventory40-50% reduction
Tier 4 (Excluded)P95 > 100msN/ASKIP entirely100% saved

DSP Selection Algorithm with Cost Optimization:

Enhanced algorithm that incorporates both latency AND cost:

Step 1: User Context Identification

Step 2: Fetch DSP Performance Data

Ad Server retrieves current performance data from Redis cache for all DSPs:

Step 3: Apply Tier-Based Filtering Rules

Tier 4 DSPs (P95 > 100ms): Skip entirely. These DSPs timeout too frequently to justify egress bandwidth cost. Result: 100% egress savings for excluded DSPs.

Tier 3 DSPs (P95 80-100ms): Include only for premium inventory. For standard or remnant inventory, the slow response time doesn’t justify waiting. Result: 40-50% of Tier 3 calls eliminated.

Tier 2 DSPs (P95 50-80ms): Include only if DSP region matches user region. Cross-region calls add 30-60ms network latency, making these DSPs non-competitive. Result: 15-25% of Tier 2 calls eliminated.

Tier 1 DSPs (P95 < 50ms): Always include with optimized timeout. Premium DSPs like Google AdX and Magnite have multi-region infrastructure, ensuring fast response regardless of user location.

Step 4: Assign Dynamic Timeouts

For each included DSP, set individualized timeout based on predictive timeout calculation. Fast DSPs get shorter timeouts (35-45ms), slower DSPs get longer timeouts (65-100ms), reducing average wait time.

Step 5: Outcome

Selected DSPs: 20-30 DSPs per request (down from 50 without optimization)

Timeout distribution:

Savings achieved:

Cost Impact Analysis:

Before optimization (baseline):

After optimization (with predictive timeouts):

Additional benefits:

    
    graph TB
    subgraph DSP_SERVICE["DSP Performance Tier Service"]
        METRICS[("Latency Metrics DB
P50/P95/P99 per DSP
Hourly rolling window")] CALC["Predictive Timeout Calculator
T = min P95 + 10ms, 100ms"] TIER["Tier Assignment Logic
Tier 1-4 based on P95"] CACHE[("Redis Cache
DSP performance data
1ms lookup latency")] METRICS --> CALC CALC --> TIER TIER --> CACHE end subgraph AD_FLOW["Ad Server Request Flow"] REQ["Ad Request
1M QPS"] LOOKUP["Lookup DSP Performance
from Redis cache"] FILTER["Filter DSPs
Apply tier rules"] FANOUT["Fan-out to Selected DSPs
With dynamic timeouts"] COLLECT["Collect Responses
Progressive auction"] REQ --> LOOKUP LOOKUP --> FILTER FILTER --> FANOUT FANOUT --> COLLECT end subgraph COST["Cost Impact"] BEFORE["Before: 50 DSPs
200KB egress per request
Baseline 100 percent"] AFTER["After: 27 DSPs
110KB egress per request
55 percent of baseline"] SAVINGS["Improvement:
45 percent egress reduction
25 ms latency improvement"] BEFORE -.-> AFTER AFTER -.-> SAVINGS end CACHE --> LOOKUP FANOUT --> METRICS style SAVINGS fill:#d4edda style FILTER fill:#fff3cd style TIER fill:#e1f5ff

Implementation Details:

1. DSP Performance Metrics Collection:

Track per-DSP metrics with hourly aggregation using time-series database (InfluxDB or Prometheus):

Latency Metrics:

Performance Metrics:

Each metric is tagged with DSP identifier and region for granular analysis and tier assignment.

2. Hourly Tier Recalculation:

Automated job runs every hour:

  1. Query last 1 hour of DSP latency data
  2. Calculate P95 for each DSP
  3. Compute predictive timeout: T = min(P95 + 10ms, 100ms)
  4. Assign tier based on P95:
    • Tier 1: P95 < 50ms
    • Tier 2: P95 50-80ms
    • Tier 3: P95 80-100ms
    • Tier 4: P95 > 100ms (exclude)
  5. Update Redis cache with new tier + timeout data
  6. Alert if Tier 1 DSP degrades to Tier 2/3

3. Ad Server Integration:

Ad Server fetches DSP performance data via REST API endpoint. For a request from US-East region, the service returns current performance data for all DSPs:

Example DSP Performance Data (US-East Region):

DSPTierPredictive TimeoutP95 LatencyResponse RateRegionInclude?
Google AdX145ms35ms95%GlobalYes (Always)
Regional DSP A238ms28ms92%US-EastYes (Same region)
Regional DSP B242ms32ms88%EU-WestNo (Cross-region)
Slow DSP4N/A145ms15%US-EastNo (Excluded)

Data Freshness: Performance data updated hourly, cached timestamp indicates last recalculation (e.g., 2025-11-19 14:00:00 UTC).

Ad Server Decision Logic:

4. Monitoring & Alerting:

Track cost optimization effectiveness:

Metrics:

Alerts:

5. A/B Testing Impact:

Validate cost savings without revenue loss:

Test setup:

Metrics tracked:

Expected results:

Trade-offs Accepted:

  1. Reduced DSP participation: 50 → 27 DSPs per request

    • Mitigation: Tier 1 premium DSPs (Google AdX, Magnite) always included
    • Impact: Only low-performing DSPs excluded
  2. Complexity: Additional service to maintain

    • Justification: 45% egress cost savings significantly exceeds incremental maintenance overhead
    • Operational overhead: Minimal (automated tier calculation, 1-2 days/month monitoring)
  3. False exclusions during DSP recovery: If DSP was slow for 1 hour but recovers, stays excluded until next hourly update

    • Mitigation: Consider 15-minute recalculation window for Tier 1 DSPs
    • Impact: Minimal (most DSP performance is stable hour-to-hour)

ROI Analysis:

Investment:

Benefits:

Conclusion:

Predictive DSP timeouts with tier-based filtering is a high-impact, low-risk optimization that:

This optimization transforms egress bandwidth from the largest variable operational cost to a manageable, optimized expense.


ML Inference Pipeline

Feature Engineering Architecture

Machine learning for CTR prediction requires real-time feature computation. Features fall into four categories, ordered by signal availability (most reliable first):

  1. Contextual features (always available): Page URL/content, device type, time of day, geo-IP location, referrer, session depth. These are the primary signals when user identity is unavailable (40-60% of mobile traffic due to ATT/Privacy Sandbox).
  2. Static features (pre-computed, stored in cache): User demographics, advertiser account info, historical campaign performance - requires stable user_id
  3. Real-time features (computed on request): Current session behavior, recently viewed categories, cart contents
  4. Aggregated features (streaming aggregations): User’s last 7-day engagement rate, advertiser’s hourly budget pace, category-level CTR trends

Why contextual features are first-class:

Traditional ML pipelines treat contextual signals as “fallback” features. This is backwards in 2024/2025:

Our feature pipeline computes contextual features first, then enriches with identity-based features when available.

The challenge is computing these features within our latency budget while maintaining consistency.

Technology Selection: Event Streaming Platform

Alright, before I even think about stream processing frameworks, I need to pick the event streaming backbone. This is one of those decisions where I went down a rabbit hole for days. Here’s what I looked at:

TechnologyThroughput/PartitionLatency (p99)DurabilityOrderingScalability
Kafka100MB/sec5-15msDisk-based replicationPer-partitionHorizontal (add brokers/partitions)
Pulsar80MB/sec10-20msBookKeeper (distributed log)Per-partitionHorizontal (separate compute/storage)
RabbitMQ20MB/sec5-10msOptional persistencePer-queueVertical (limited)
AWS Kinesis1MB/sec/shard200-500msS3-backedPer-shardManual shard management

Decision: Kafka

Rationale:

While Pulsar offers elegant storage/compute separation, Kafka’s ecosystem maturity and operational tooling provide better production support for this scale.

Partitioning strategy:

Partition count: 100 partitions = 1,000 events/sec per partition (100K total throughput)

Partition key: hash(user_id) % 100

Kafka guarantees ordering within a partition, not across partitions. User-keyed partitioning ensures causally-related events (same user’s journey) stay ordered.

Cost comparison: Self-hosted Kafka (~1-2% of infrastructure baseline at scale) is significantly cheaper than AWS Kinesis at high sustained throughput (20-50× cost difference at billions of events/month). Managed services trade cost for operational simplicity.

Note: Kafka’s cost advantage scales with throughput volume - at lower volumes, managed streaming services may be more cost-effective when factoring in operational overhead.

Technology Selection: Stream Processing

Stream Processing Frameworks:

TechnologyLatencyThroughputState ManagementExactly-OnceDeployment ModelOps Complexity
Kafka Streams<50ms800K events/secLocal RocksDBYes (transactions)Library (embedded)Low
Flink<100ms1M events/secDistributed snapshotsYes (Chandy-Lamport)Separate clusterMedium
Spark Streaming~500ms500K events/secMicro-batchingYes (WAL)Separate clusterMedium
Storm<10ms300K events/secManualNo (at-least-once)Separate clusterHigh

Decision: Kafka Streams (for simple aggregations) + Flink (for complex CEP)

Initial recommendation: Kafka Streams for most use cases

For this architecture’s primary use case - windowed aggregations for feature engineering - Kafka Streams is simpler:

When to use Flink instead:

Mathematical justification:

For windowed aggregation with window size \(W\) and event rate \(\lambda\):

$$state\_size = \lambda \times W \times event\_size$$

Example: 100K events/sec, 60s window, 1KB/event → ~6GB state per operator.

Kafka Streams: 6GB state stored locally in RocksDB per instance. With 10 app instances partitioning load, that’s 600MB per instance - easily manageable.

Trade-off accepted: Start with Kafka Streams for operational simplicity. Migrate specific pipelines to Flink if/when complex CEP patterns needed (e.g., sophisticated fraud detection requiring temporal pattern matching).

Batch Processing Framework:

TechnologyProcessing SpeedFault ToleranceMemory UsageEcosystem
SparkFast (in-memory)Lineage-basedHigh (RAM-heavy)Rich (MLlib, SQL)
MapReduceSlow (disk I/O)Task restartLowLegacy
DaskFast (lazy eval)Task graphMediumPython-native

Decision: Spark

Feature Store Technology:

TechnologyServing LatencyFeature FreshnessOnline/OfflineVendor
Tecton<10ms (p99)100msBothSaaS
Feast~15ms~1sBothOpen-source (no commercial backing since 2023)
Hopsworks~20ms~5sBothOpen-source/managed
Custom (Redis)~5msManualOnline onlySelf-built

Note on Latency Comparisons: Serving latencies vary significantly by configuration (online store choice, feature complexity, deployment architecture). The figures shown represent typical ranges observed in production deployments, but actual performance depends on workload characteristics and infrastructure choices.

Decision: Tecton (with fallback to custom Redis)

Cost analysis:

Custom solution:

Managed feature store (Tecton/Databricks): SaaS fee ≈ 10-15% of one engineer FTE/year (consumption-based pricing varies by usage, contract, and scale)

Decision: Managed feature store is 5-8× cheaper in year one (avoids engineering cost), plus faster time-to-market (weeks vs months). Custom solution only makes sense at massive scale or with unique requirements managed solutions can’t support. Note that Tecton uses consumption-based pricing (platform fee + per-credit costs), so actual costs scale with usage.

1. Real-Time Features (computed per request):

2. Near-Real-Time Features (pre-computed, cache TTL ~10s):

3. Batch Features (pre-computed daily):

    
    graph TB
    subgraph "Real-Time Feature Pipeline"
        REQ[Ad Request] --> PARSE[Request Parser]
        PARSE --> CONTEXT[Context Features
time, location, device
Latency: 5ms] PARSE --> SESSION[Session Features
user actions
Latency: 10ms] end subgraph "Feature Store" CONTEXT --> MERGE[Feature Vector Assembly] SESSION --> MERGE REDIS_RT[(Redis
Near-RT Features
TTL: 10s)] --> MERGE REDIS_BATCH[(Redis
Batch Features
TTL: 24h)] --> MERGE end subgraph "Stream Processing" EVENTS[User Events
clicks, views] --> KAFKA[Kafka] KAFKA --> FLINK[Kafka Streams
Windowed Aggregation] FLINK --> REDIS_RT end subgraph "Batch Processing" S3[S3 Data Lake] --> SPARK[Spark Jobs
Daily] SPARK --> FEATURE_GEN[Feature Generation] FEATURE_GEN --> REDIS_BATCH end MERGE --> INFERENCE[ML Inference
TensorFlow Serving
Latency: 40ms] INFERENCE --> PREDICTION[CTR Prediction
0.0 - 1.0] classDef rt fill:#ffe0e0,stroke:#cc0000 classDef batch fill:#e0e0ff,stroke:#0000cc classDef store fill:#e0ffe0,stroke:#00cc00 class REQ,PARSE,CONTEXT,SESSION rt class S3,SPARK,FEATURE_GEN,REDIS_BATCH batch class REDIS_RT,MERGE,INFERENCE store

Feature Vector Construction

For each ad impression, construct feature vector \(\mathbf{x} \in \mathbb{R}^n\):

$$x = [x_{user}, x_{ad}, x_{context}, x_{cross}]$$

User Features \(\mathbf{x}_{user} \in \mathbb{R}^{50}\):

Ad Features \(\mathbf{x}_{ad} \in \mathbb{R}^{30}\):

Context Features \(\mathbf{x}_{context} \in \mathbb{R}^{20}\):

Cross Features \(\mathbf{x}_{cross} \in \mathbb{R}^{50}\):

Total dimensionality: 150 features.

Model Architecture: Gradient Boosted Trees vs. Neural Networks

Technology Selection: ML Model Architecture

Comparative Analysis:

CriterionGBDT (LightGBM/XGBoost)Deep Neural NetworkFactorization Machines
Inference Latency5-10ms (CPU)20-40ms (GPU required)3-5ms (CPU)
Training Time1-2 hours (daily)6-12 hours (daily)30min-1hour
Data EfficiencyGood (100K+ samples)Requires 10M+ samplesGood (100K+ samples)
Feature EngineeringManual requiredAutomatic interactionsAutomatic 2nd-order
InterpretabilityHigh (feature importance)Low (black box)Medium (learned weights)
Memory Footprint100-500MB1-5GB50-200MB
Categorical FeaturesNative supportEmbedding layers neededNative support

Latency Budget Analysis:

Recall: ML inference budget = 40ms (out of 150ms total)

$$T_{ml} = T_{feature} + T_{inference} + T_{overhead}$$

Accuracy Comparison:

CTR prediction is fundamentally constrained by signal sparsity - user click rates are 0.1-2% in ads (industry benchmark: display 0.5%, video 1.8%), creating severe class imbalance. Model performance expectations:

AUC improvements translate directly to revenue: at 100M daily impressions, a 1% AUC improvement (~0.5-1% CTR lift) generates significant monthly revenue gain proportional to baseline CPM and monthly volume.

Decision Matrix (Infrastructure Costs Only):

$$Value_{infra} = \alpha \times Accuracy - \beta \times Latency - \gamma_{infra} \times OpsCost$$

With \(\alpha = 100\) (revenue impact), \(\beta = 50\) (user experience), \(\gamma_{infra} = 10\) (infrastructure only):

FM has the highest infrastructure value, but this analysis omits operational complexity.

Production Decision: GBDT

Operational factors favor GBDT despite FM’s infrastructure advantage:

  1. Ecosystem maturity: LightGBM/XGBoost have 10× more production deployments - easier hiring, better tooling, more community support
  2. Debuggability: SHAP values enable root cause analysis when CTR drops unexpectedly - FM provides limited interpretability
  3. Incremental learning: GBDT supports online learning - FM requires full retraining
  4. Production risk: Deploying less-common FM technology introduces operational burden that outweighs the 16-point mathematical advantage

Trade-off: Accept 5ms extra latency and 2-3% AUC gap for operational simplicity and team velocity.

Architectural Driver: Latency - GBDT’s 20ms total inference time (including feature lookup) fits within our 40ms ML budget. We rejected DNNs despite their 2-3% accuracy advantage because their 45ms latency would push the ML path to 75ms, reducing our variance buffer significantly.

Trade-off accepted: 5ms extra latency (GBDT vs FM) for operational benefits.

Option 1: Gradient Boosted Decision Trees (GBDT)

Advantages:

Disadvantages:

Typical hyperparameters: 100 trees, depth 7, learning rate 0.05, with feature/data sampling for regularization. Inference latency scales linearly with tree count (~8ms for 100 trees).

Option 2: Deep Neural Network (DNN)

Advantages:

Disadvantages:

Typical architecture: Embedding layers for categoricals, followed by 3 dense layers (256→128→64 units with ReLU, 0.3 dropout), sigmoid output. Trained via binary cross-entropy with Adam optimizer. Inference latency ~20-40ms depending on batch size and hardware (GPU vs CPU).

2025 Reality Check: DL is Increasingly Viable

The “DNN is too slow” argument is increasingly outdated. Modern inference optimization techniques make deep learning viable even within strict latency budgets:

Evolution Path: Two-Pass Ranking

The industry standard at scale (Google, Meta, TikTok) is a two-stage ranking architecture:

  1. Stage 1 - Candidate Generation (GBDT, 5-10ms): Fast model reduces millions of ads → 50-200 candidates. This is where our GBDT excels.
  2. Stage 2 - Reranking (Lightweight DL, 10-15ms): More expressive model scores the small candidate set. Distilled neural network captures complex feature interactions.

Why start with GBDT-only:

Our Day-1 GBDT approach is pragmatic, not a permanent ceiling:

Planned evolution (6-12 months post-launch):

The Cold Start Problem: Serving Ads Without Historical Data

The Challenge:

Your CTR prediction models depend on historical user behavior, advertiser performance, and engagement patterns. But what happens when:

Serving random ads would devastate revenue and user experience. You need a multi-tier fallback strategy that gracefully degrades from personalized to increasingly generic predictions.

Multi-Tier Cold Start Strategy:

The key architectural principle: graceful degradation from personalized to generic predictions as data availability decreases. Each tier represents a fallback when insufficient data exists for the previous tier.

Quick Comparison:

TierData ThresholdStrategyRelative Accuracy
1>100 impressionsPersonalized MLHighest (baseline)
210-100 impressionsCohort-based-10-15% vs Tier 1
3<10 impressionsDemographic avg-15-25% vs Tier 1
4No dataCategory priors-20-30% vs Tier 1

Tier 1: Rich User History (>100 impressions)

Tier 2: User Cohort (10-100 impressions)

Tier 3: Broad Segment (<10 impressions)

Tier 4: Global Baseline (No user data)

Accuracy Trade-off Pattern:

Accuracy degrades as you move down tiers, but the relative pattern matters more than exact numbers:

$$Accuracy_{\text{(Tier N)}} < Accuracy_{\text{(Tier N-1)}}$$

Typical degradation observed in production CTR systems (based on industry reports from Meta, Google, Twitter ad platforms):

Total accuracy range: Tier 1 might achieve AUC 0.78-0.82, while Tier 4 drops to 0.60-0.68. Exact values depend heavily on:

Key insight: Even degraded predictions (Tier 3-4) significantly outperform random serving (AUC 0.50), which would be catastrophic for revenue.

Mathematical Model - ε-greedy Exploration:

For new users, balance exploitation (show known high-CTR ads) vs exploration (gather data for future personalization):

$$a_t = \begin{cases} \arg\max_a Q(a) & \text{with probability } 1 - \epsilon \\ \text{random action} & \text{with probability } \epsilon \end{cases}$$

where:

Adaptive exploration rate:

$$\epsilon(n) = \frac{\epsilon_0}{1 + \log(n + 1)}$$

where \(n\) is the number of impressions served to this user. New users get \(\epsilon = 0.10\) (10% random exploration), converging to \(\epsilon = 0.02\) after 1000 impressions.

Advertiser Bootstrapping:

New advertisers face similar challenges - their ads have no performance history. Strategy:

  1. Minimum spend requirement: Require minimum spend threshold before enabling full optimization
  2. Broad targeting phase: First 10K impressions use broad targeting to gather signal across demographics
  3. Thompson Sampling: Bayesian approach for bid optimization during bootstrap phase

$$P(\theta | D) \propto P(D | \theta) \times P(\theta)$$

where \(\theta\) = true CTR, \(D\) = observed clicks/impressions. Sample from posterior to balance exploration/exploitation.

Platform Launch (Day 1) Scenario:

When launching the entire platform with zero historical data:

  1. Pre-seed with industry benchmarks: Use published CTR averages by vertical (e-commerce: 2%, finance: 0.5%, gaming: 5%)
  2. Synthetic data generation: Create simulated user profiles and engagement patterns for initial model training
  3. Rapid learning mode: First 48 hours run at \(\epsilon = 0.20\) (high exploration) to quickly gather training data
  4. Cohort velocity tracking: Monitor how quickly each cohort accumulates usable signal

$$T_{bootstrap} = \frac{N_{min}}{R_{impressions} \times P_{engagement}}$$

where:

Example: To gather 100 clicks at 2% CTR with 10 impressions/day per user: \(T = \frac{100}{10 \times 0.02} = 500\) days per user. Solution: aggregate across cohorts to reach critical mass faster.

Trade-off Analysis:

Cold start strategy impacts revenue during bootstrap period:

Launch decision: Accept 65% initial revenue rather than delaying for data that can only be gathered post-launch.

Signal Loss vs Cold Start: The Privacy-Era Challenge

Cold start (new users with no history) and signal loss (returning users we can’t identify) require different strategies. Signal loss is increasingly common due to privacy regulations:

ScenarioCauseAvailable SignalsStrategy
Cold StartNew user, first visitDevice, geo, time + page contextExploration + cohort fallback
Signal LossATT opt-out, cookie blockedDevice, geo, time + page contextContextual-only bidding
Partial SignalCross-device, new browserSome history, fragmentedProbabilistic identity matching

Key difference: Cold start users will eventually accumulate history. Signal loss users never will - they remain anonymous indefinitely.

Bidding Strategy Without User Identity:

When user_id is unavailable (40-60% of mobile traffic), the bidding strategy shifts entirely to contextual signals:

1. Contextual Bid Adjustment:

$$eCPM_{contextual} = BaseCPM \times ContextMultiplier \times QualityScore$$

Where ContextMultiplier is derived from:

2. Publisher-Level Optimization:

Without user identity, optimize at publisher level instead:

3. Revenue Expectations:

Contextual-only inventory achieves:

Trade-off accepted: Lower revenue per impression is better than zero revenue from blocked/unavailable users. The 40-60% of traffic without identity still represents significant revenue at scale.

Model Serving Infrastructure

Technology Selection: Model Serving

Model Serving Platforms:

PlatformLatency (p99)ThroughputBatchingGPU SupportOps Complexity
TensorFlow Serving30-40ms1K req/secAutoExcellentMedium
TorchServe35-45ms800 req/secAutoGoodMedium
NVIDIA Triton25-35ms1.5K req/secAutoExcellentHigh
Seldon Core40-50ms600 req/secManualGoodHigh (K8s)
Custom Flask/FastAPI50-100ms200 req/secManualPoorLow

Decision: TensorFlow Serving (primary) with NVIDIA Triton (evaluation)

Rationale:

NVIDIA Triton consideration: 20% lower latency, but requires heterogeneous model formats (TF, PyTorch, ONNX). Added complexity not justified unless multi-framework requirement emerges.

Technology Selection: Container Orchestration

Container orchestration must handle GPU scheduling for ML workloads, scale appropriately, and avoid cloud vendor lock-in. Technology comparison:

TechnologyLearning CurveEcosystemAuto-scalingMulti-cloudNetworking
KubernetesSteepMassive (CNCF)HPA, VPA, Cluster AutoscalerYes (portable)Advanced (CNI, Service Mesh)
AWS ECSMediumAWS-nativeTarget tracking, step scalingNo (AWS-only)AWS VPC
Docker SwarmEasyLimitedBasic (replicas)Yes (portable)Overlay networking
NomadMediumHashiCorp ecosystemAuto-scaling pluginsYes (portable)Consul integration

Decision: Kubernetes

Architectural Driver: Availability - Kubernetes auto-scaling (HPA) and self-healing prevent capacity exhaustion during traffic spikes. GPU node affinity ensures ML inference survives node failures by automatically rescheduling pods.

Rationale:

While Kubernetes introduces operational complexity, GPU orchestration and multi-cloud requirements justify the investment.

Kubernetes-specific features critical for ads platform:

  1. Horizontal Pod Autoscaler (HPA) with Custom Metrics:

    CPU/memory metrics are lagging indicators for this workload - ML inference is GPU-bound (CPU at 20% while GPU saturated), and CPU spikes occur after queue buildup. Use workload-specific metrics instead:

    Scaling formula: \(\text{desired replicas} = \lceil \text{current replicas} \times \frac{\text{current metric}}{\text{target metric}} \rceil\)

    Custom metrics:

    • Inference queue depth: Target 100 requests (current: 250 → scale 10 to 25 pods)
    • Request latency p99: Target 80ms within 100ms budget
    • Cache hit rate: Scale cache tier when <85%

    Accounting for provisioning delays:

    $$N_{buffer} = \frac{dQ}{dt} \times (T_{provision} + T_{warmup})$$

    where \(\frac{dQ}{dt}\) = traffic growth rate, \(T_{provision}\) = node startup (30-40s for modern GPU instances with pre-warmed images), \(T_{warmup}\) = model loading (10-15s with model streaming).

    Example: Traffic growing at 10K QPS/sec with 40s total startup requires scaling at \(90\% - \frac{400 \text{ pods}}{\text{capacity}}\) to avoid overload during provisioning. Trade-off: GPU node startup latency forces earlier scaling with higher idle capacity cost.

  2. GPU Node Affinity:

    • Schedule ML inference pods only on GPU nodes using node selectors
    • Prevents GPU resource waste by isolating GPU workloads
  3. StatefulSets for Stateful Services:

    • Deploy CockroachDB, Redis clusters with stable network identities
    • Ordered pod creation/deletion (e.g., CockroachDB region placement first)
  4. Istio Service Mesh:

    • Traffic splitting: A/B test new model versions (90% traffic to v1, 10% to v2)
    • Circuit breaking: Automatic failure detection, failover to backup services
    • Observability: Automatic trace injection, latency histograms per service

Why not AWS ECS?

ECS advantages (managed, lower cost) offset by:

Why not Docker Swarm:

The cost trade-off (rough comparison for ~100 nodes):

Kubernetes (managed service like EKS):

AWS ECS (Fargate):

So why might I still choose Kubernetes despite slightly higher costs?

The GPU support and multi-cloud portability matter for this use case. ECS Fargate has limited GPU support, and I prefer not being locked into AWS. The premium (perhaps 10-20% higher monthly costs) acts as insurance against vendor lock-in and provides proper GPU scheduling for ML workloads.

That said, your calculation might differ - ECS could make sense if you’re committed to AWS and don’t need GPU orchestration.

Deployment Strategy Comparison:

StrategyCold StartAuto-scalingCostReliability
Dedicated instances0ms (always warm)ManualHigh (24/7)High
Kubernetes pods30-60sAuto (HPA)MediumMedium
Serverless (Lambda)5-10sInstantLow (pay-per-use)Low (cold starts)

Decision: Dedicated GPU instances with Kubernetes orchestration

Cost-benefit calculation:

Option A: Dedicated T4 GPUs (always-on)

Option B: Kubernetes with auto-scaling (3 min, 10 max instances)

Option C: AWS Lambda with GPU

Winner: Option B (Kubernetes with auto-scaling) - balances cost and performance.

To meet sub-40ms latency requirements, use TensorFlow Serving with optimizations:

1. Request Batching

Goal: Maximize GPU utilization by processing multiple predictions simultaneously, trading a small amount of latency for significantly higher throughput.

Approach:

How to determine values:

  1. Measure single-request inference latency (baseline)
  2. Incrementally increase batch size and measure both throughput and total latency
  3. Stop when latency approaches your budget (e.g., if you have 40ms total budget and queuing adds 10ms, ensure inference completes in <30ms)
  4. Consider dynamic batching that adjusts based on queue depth

2. Model Quantization

Convert FP32 → INT8:

Mathematical Transformation:

For weight matrix \(W \in \mathbb{R}^{m \times n}\) with FP32 precision:

$$W_{int8}[i,j] = \text{round}\left(\frac{W[i,j] - W_{min}}{W_{max} - W_{min}} \times 255\right)$$

Inference: $$y = W_{int8} \cdot x_{int8} \times scale + zero\_point$$

Benefits:

3. CPU-Based GBDT Inference: Architecture Decision

Why CPU-Only for Day 1 GBDT:

GBDT models (LightGBM/XGBoost) are CPU-optimized for inference workloads. External research confirms CPU achieves 10-20ms inference latency for GBDT models at production scale, well within our 40ms budget:

Throughput and Latency Analysis (GBDT-specific):

Compute TypeThroughput (GBDT)LatencyInfrastructure CostOperational Complexity
CPU inference (optimized)50-200 req/sec per core10-20msBaseline (1.0×)Low (standard deployment)
GPU inference (T4)1,000-1,500 req/sec per GPU8-15ms1.5-2× CPU costMedium (GPU orchestration)

Decision Rationale:

We chose CPU-only architecture for our Day 1 GBDT deployment:

Advantages (why CPU):

Trade-offs (what we give up):

Evolution Path: Adding DNN Reranking on CPU

Our Day-1 CPU architecture supports planned model evolution without infrastructure rebuild:

Phase 2 (6-12 months): Two-Stage Ranking on CPU

  1. Stage 1 - GBDT Candidate Generation (5-10ms):

    • Current architecture, reduce 10M ads → 200 top candidates
    • CPU-based, unchanged from Day 1
  2. Stage 2 - Distilled DNN Reranking (10-15ms):

Total two-stage latency: 5-10ms (GBDT) + 10-15ms (distilled DNN) = 15-25ms (within 40ms budget)

Requirements for CPU-based DNN evolution:

What this evolution path gives up:

We are explicitly choosing to constrain model complexity to what runs efficiently on CPU. This means:

Why we accept these trade-offs:

At 1M QPS serving 400M DAU, our priorities are:

  1. Cost efficiency (CPU saves 30-40% infrastructure cost = millions annually)
  2. Operational stability (simpler infrastructure = fewer outages)
  3. Team velocity (standard deployment = faster iteration)

The 1-2% AUC ceiling we might hit in 12-18 months is worth the operational and cost benefits today. We can revisit the GPU decision if/when model quality plateaus.

Alternative: When to choose GPU instead?

GPU makes sense for teams with different constraints:

For our use case (1M QPS, cost-sensitive, operationally focused), CPU is the pragmatic choice.

Feature Store: Tecton Architecture

Architectural Overview

Tecton implements a declarative feature platform with strict separation between definition (what features to compute) and execution (how to compute them). Critical for ads platforms: achieving sub-10ms p99 serving latency while maintaining 100ms feature freshness for streaming aggregations.

Key Architectural Decisions

1. Flink Integration Model

Critical distinction: Flink is external to Tecton, not a computation engine. Flink handles stateful stream preparation (deduplication, enrichment, cross-stream joins) upstream, publishing cleaned events to Kafka/Kinesis. Tecton’s engines (Spark Streaming or Rift) consume these pre-processed streams for feature computation.

Integration pattern:

    
    graph LR
    RAW[Raw Events
clicks, impressions
bid requests] FLINK[Apache Flink
Data Quality Layer
Deduplication
Enrichment
Cross-stream joins] KAFKA[Kafka/Kinesis
Cleaned Events
System Boundary] STREAM[Tecton StreamSource
Event Consumer] COMPUTE[Feature Computation
Rift or Spark Streaming
Time windows
Aggregations] RAW --> FLINK FLINK --> KAFKA KAFKA --> STREAM STREAM --> COMPUTE style FLINK fill:#f0f0f0,stroke:#666,stroke-dasharray: 5 5 style KAFKA fill:#fff3cd,stroke:#333,stroke-width:3px style STREAM fill:#e1f5ff style COMPUTE fill:#e1f5ff

This separation follows the “dbt for streams” pattern - Flink normalizes data infrastructure concerns (left of Kafka), Tecton handles ML-specific transformations (right of Kafka).

2. Computation Engine Selection

Tecton abstracts three engines behind a unified API:

EngineThroughput ThresholdOperational ComplexityStrategic Direction
SparkBatch (TB-scale)High (cluster management)Mature, stable
Spark Streaming>1K events/secHigh (Spark cluster + streaming semantics)For high-throughput only
Rift<1K events/secLow (managed, serverless)Primary (GA 2025)

Rift is Tecton’s strategic direction: Purpose-built for feature engineering workloads, eliminates Spark cluster overhead for the 80% use case. Most streaming features don’t exceed 1K events/sec threshold where Spark Streaming’s complexity becomes justified.

3. Dual-Store Architecture

The offline/online store separation addresses fundamentally different access patterns:

Offline Store (S3 Parquet):

Online Store (Redis):

Why not a unified store? Columnar formats (Parquet) optimize analytical queries but introduce 100ms+ latency for point lookups. Key-value stores (Redis) can’t efficiently handle time-range scans. The dual-store pattern accepts storage duplication to optimize each access pattern independently.

4. Data Source Abstractions

Tecton’s source types map to different freshness/availability guarantees:

Architectural insight: RequestSource features bypass the online store entirely - computed per-request via Rift. This avoids cache invalidation complexity for contextual data (time-of-day, request headers) that changes per-request.

Feature Materialization Flow

For a streaming aggregation feature (e.g., “user’s 1-hour click rate”):

    
    graph TB
    KAFKA[Kafka Events
user_id: 12345, event: click] RIFT[Rift Engine
Sliding Window Aggregation] ONLINE[(Online Store
Redis)] OFFLINE[(Offline Store
S3 Parquet)] REQ_SERVE[Inference Request] REQ_TRAIN[Training Query
time range: 14 days] RESP_SERVE[Response
5ms p99] RESP_TRAIN[Historical Data
Point-in-time correct] KAFKA -->|Stream Events| RIFT RIFT -->|OVERWRITE latest| ONLINE RIFT -->|APPEND timestamped| OFFLINE REQ_SERVE -->|Lookup user_id| ONLINE ONLINE -->|Return current features| RESP_SERVE REQ_TRAIN -->|Scan user_id + timestamps| OFFLINE OFFLINE -->|Return time-series| RESP_TRAIN style RIFT fill:#e1f5ff style ONLINE fill:#fff3cd style OFFLINE fill:#fff3cd style RESP_SERVE fill:#d4edda style RESP_TRAIN fill:#d4edda

Critical property: Both stores materialize from the same transformation definition (executed in Rift), guaranteeing training/serving consistency. The transformation runs once, writes to both stores atomically.

Performance Characteristics

Latency budget allocation (within 150ms total SLO):

Feature freshness guarantees:

Serving APIs: REST (HTTP/2), gRPC (lower protocol overhead), and SDK (testing/batch) all query the same online store - interface choice driven by client requirements, not architectural constraints.

Feature Classification and SLA:

Not all features are equal - different types have different freshness and failure characteristics:

Feature TypeExamplesFreshnessFallback on Failure
Stale (Pre-computed)7-day avg CTR, user segment1-5 minUse 1-hour-old cache
Fresh (Contextual)Time of day, device batteryReal-timeCompute locally (0ms)
Semi-Fresh1-hour CTR, session ad count30-60sUse 24-hour avg
StaticDevice model, OS versionDailyUse defaults

Distribution: 70% Stale, 20% Fresh (local), 8% Semi-Fresh, 2% Static

Feature Store SLA:

MetricTargetRationale
Latency p99<10msFits within 150ms total SLO
Availability99.9%Matches platform SLA
Freshness<60s for streamingBalance accuracy vs ops complexity
Cache hit rate>95%Redis availability requirement

Circuit Breaker Integration:

The Feature Store integrates with the circuit breaker system for graceful degradation:

ServiceBudgetTrip ThresholdFallbackRevenue Impact
Feature Store10msp99 > 15ms for 60sCold start features-10%

Cold Start Fallback Strategy:

When Feature Store fails/exceeds budget:

Normal features (35-50 from Redis):

Cold start features (8-12, local only):

Cold start ML model:

Failure Modes:

Mode 1: Individual cache misses (5-10%) - Use default values, -1-2% revenue

Mode 2: Partial Redis failure (30-50%) - Mixed normal + cold start, -4-6% revenue

Mode 3: Total Redis failure (100%) - All cold start, -10% revenue, P1 alert

Mode 4: Latency spike (p99 > 15ms) - Circuit trips, cold start, -10% revenue

Monitoring:

Metrics:

Alerts:

Build vs. Buy Economics

Custom implementation costs:

Managed Tecton:

Break-even: Year 1, managed is 5-8× cheaper (avoids engineering cost). Custom only justified at massive scale (>10B features/day) or unique requirements (specialized hardware, exotic data sources).

Integration Context

Feature Store sits on the critical path with strict latency requirements:

    
    graph LR
    AD_REQ[Ad Request
100ms RTB timeout] USER_PROF[User Profile Lookup
10ms budget] FEAT_STORE[Feature Store Lookup
10ms budget
Redis: 5ms read
Assembly: 2ms
Protocol: 3ms] ML_INF[ML Inference
40ms budget
GBDT model] AUCTION[Auction Logic
10ms budget] BID_RESP[Bid Response
Total: 70ms
Margin: 30ms] AD_REQ --> USER_PROF USER_PROF --> FEAT_STORE FEAT_STORE --> ML_INF ML_INF --> AUCTION AUCTION --> BID_RESP style FEAT_STORE fill:#fff3cd style ML_INF fill:#e1f5ff style BID_RESP fill:#d4edda

Architectural constraint: Feature lookup must complete within 10ms to preserve 40ms ML inference budget. This eliminates database-backed stores (CockroachDB: 10-15ms p99) and necessitates in-memory key-value stores. Redis selected (5ms p99) over DynamoDB (8ms p99) for the tightest latency margin.

The diagram below illustrates how features flow through Tecton’s architecture - from raw data ingestion through computation and storage, to serving ML inference. The system supports three parallel computation paths optimized for different data freshness requirements: batch (daily updates), streaming (sub-second updates), and real-time (computed per request).

    
    graph TB
    subgraph SOURCES["Data Sources"]
        S3[(S3/Snowflake
Historical batch data)] KAFKA[Kafka/Kinesis
Real-time event streams] DB[(PostgreSQL/APIs
Request-time data)] end subgraph COMPUTE["Feature Computation Paths"] BATCH[Path A: Batch Features
Daily aggregations, user profiles
Engine: Spark] STREAM[Path B: Stream Features
Time-window aggregations hourly
Engine: Spark Streaming or Rift] REALTIME[Path C: Real-Time Features
Computed per request
Engine: Rift] end subgraph STORAGE["Feature Storage Layer"] OFFLINE[(Offline Store
S3 Parquet
For ML training)] ONLINE[(Online Store
Redis 5ms p99
For serving)] end subgraph SERVING["Serving APIs"] API[Tecton Feature Server
REST API
gRPC API
Python/Java SDK] end subgraph CONSUMERS["Consumers"] TRAIN[ML Training
Batch jobs] INFERENCE[ML Inference
Real-time serving] end S3 -->|Historical data| BATCH KAFKA -->|Event stream| STREAM DB -->|Request-time| REALTIME BATCH -->|Materialize| OFFLINE BATCH -->|Materialize| ONLINE STREAM -->|Materialize| ONLINE REALTIME -->|Compute on request| API OFFLINE -->|Training datasets| TRAIN ONLINE -->|Feature lookup| API API -->|Features| INFERENCE classDef source fill:#e1f5fe,stroke:#01579b,stroke-width:2px classDef compute fill:#fff3e0,stroke:#e65100,stroke-width:2px classDef storage fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px classDef serving fill:#fce4ec,stroke:#880e4f,stroke-width:2px classDef consumer fill:#f3e5f5,stroke:#4a148c,stroke-width:2px class S3,KAFKA,DB source class BATCH,STREAM,REALTIME compute class OFFLINE,ONLINE storage class API serving class TRAIN,INFERENCE consumer

Key architectural points:

  1. Three computation paths run independently based on data source characteristics:

    • Path A (Batch): Processes historical data daily for features like “user’s average CTR over 30 days”
    • Path B (Stream): Processes real-time events for features like “clicks in last 1 hour”
    • Path C (Real-Time): Computes features on-demand per request for context-specific features
  2. Engine alternatives (not separate systems):

    • Batch path uses Spark for distributed processing
    • Stream path uses Spark Streaming OR Rift (Tecton’s proprietary engine - choice depends on scale and latency requirements)
    • Real-time path uses Rift for sub-10ms computation
  3. Serving API consolidation: Single Feature Server exposes three API options (REST, gRPC, SDK) - these are different interfaces to the same service, not separate deployments

  4. Dual storage purpose:

    • Offline Store: Provides point-in-time consistent training datasets for ML model training
    • Online Store: Optimized for low-latency feature lookup during real-time inference (<10ms p99)

Feature Freshness Guarantees:

Latency SLA: $$P(\text{FeatureLookup} \leq 10ms) \geq 0.99$$

Achieved with Redis (selected):


ML Operations & Continuous Model Monitoring

Architectural Driver: Production ML Reliability - Deploying a CTR prediction model is the beginning, not the end. Production ML systems degrade over time as user behavior shifts, competitors change strategies, and seasonal patterns emerge. Without continuous monitoring and automated retraining, model accuracy drops 5-15% within weeks, directly impacting revenue.

The Hidden Challenge of Production ML:

Models trained on historical data assume the future resembles the past. This assumption breaks in real-world ad platforms:

Impact without MLOps:

Solution: Automated monitoring, drift detection, and retraining pipeline that maintains model performance within acceptable bounds (AUC ≥ 0.75) while minimizing operational overhead.

This section details the production ML infrastructure that keeps the CTR prediction model accurate and reliable at 1M+ QPS.

Model Quality Metrics: Offline vs Online

Production ML requires two complementary measurement systems: offline metrics (training/validation) and online metrics (production). Both are necessary because they measure different aspects of model health.

Offline Metrics (Training & Validation Phase):

These metrics are computed on held-out validation data before deployment:

AUC-ROC (Area Under Curve):

Calibration (Predicted CTR vs Actual CTR):

Log Loss (Cross-Entropy):

Online Metrics (Production Monitoring):

These metrics track real-world performance with live traffic:

Click-Through Rate (CTR):

Effective Cost Per Mille (eCPM):

P95 Inference Latency:

Prediction Error Rate:

Why Both Offline AND Online:

Offline metrics validate model quality before deployment (gate check), but cannot predict production behavior:

Concept Drift Detection: When Models Go Stale

What is Concept Drift:

Concept drift occurs when the statistical properties of the target variable change over time. In CTR prediction, this means the relationship between features and click probability shifts.

Real-World Examples:

  1. Seasonal drift: Holiday shopping season (Nov-Dec) sees 30-40% higher CTR than baseline due to increased purchase intent
  2. Competitive drift: New competitor launches aggressive campaign → user attention shifts → our CTR drops 5-10%
  3. Platform drift: Browser updates change rendering behavior → creative load times shift → CTR patterns change
  4. Economic drift: Recession reduces consumer spending → conversion rates drop → advertisers bid lower → auction dynamics shift

Impact Magnitude:

Without drift detection:

Detection Methods:

Population Stability Index (PSI):

PSI measures distribution shift between training and production data.

Formula: $$\text{PSI} = \sum_{i=1}^{n} (\text{actual}_i - \text{expected}_i) \times \ln\left(\frac{\text{actual}_i}{\text{expected}_i}\right)$$

where \(n\) = number of bins (typically 10).

Interpretation Thresholds:

Implementation:

Example Calculation:

Compare training data distribution vs production data distribution (10 bins):

BinTraining %Production %PSI Contribution
110%8%(0.08-0.10)×ln(0.08/0.10) = 0.0045
215%13%(0.13-0.15)×ln(0.13/0.15) = 0.0029
320%22%(0.22-0.20)×ln(0.22/0.20) = 0.0019
100.5%0.5%(0.005-0.005)×ln(1) = 0

Total PSI = 0.12 (Moderate drift - monitor closely)

Kolmogorov-Smirnov (KS) Test:

KS test detects if feature distributions have shifted.

What it measures: Maximum distance between cumulative distribution functions Threshold: KS statistic > 0.2 indicates significant distribution change Applied to: Top 20 features (by importance score from model) Frequency: Weekly check

Example:

Rolling AUC Monitoring:

Track model AUC on production data over time.

Method:

Thresholds:

Automated Alerting Strategy:

P1 Critical Alerts (Immediate Retraining):

P2 Warning Alerts (Schedule Retraining within 48 hours):

Why Multi-Signal Approach:

Automated Retraining Pipeline: Keeping Models Fresh

Retraining Triggers:

Three trigger conditions initiate automated retraining:

  1. Scheduled: Every Sunday at 2 AM UTC (weekly cadence, low-traffic window)
  2. Drift-Detected: PSI > 0.25 for 3 days OR AUC < 0.75 for 3 days
  3. Manual: Engineer-initiated via command-line tool (for major platform changes, new features)

7-Step Retraining Pipeline:

Step 1: Data Collection (30 minutes)

What happens:

Data volume:

Quality gates:

Step 2: Data Validation (10 minutes)

Validation Checks:

Null Detection:

Outlier Detection:

Distribution Validation:

Action on Validation Failure:

Step 3: Model Training (2-4 hours)

Algorithm: LightGBM (Gradient Boosted Decision Trees)

Already established choice (see Model Architecture section above for rationale).

Hyperparameter Grid Search:

Parameters to tune:

Search Strategy:

Hardware:

Training Output:

Step 4: Model Evaluation

Evaluation Criteria (All Must Pass):

Criterion 1: AUC Threshold

Criterion 2: Calibration Check

Criterion 3: Performance Improvement

Rejection Handling:

Step 5: Shadow Deployment (24 hours, 10% traffic)

What is Shadow Deployment:

Run new model in parallel with current model, but do NOT serve new model’s predictions to users. Log both models’ predictions for comparison.

Configuration:

Metrics Tracked:

Decision Criteria:

Action:

Step 6: Canary Deployment (48 hours, 10% production)

What is Canary:

Serve real traffic with new model (10%), monitor business metrics.

Configuration:

Metrics Monitored:

Business Metrics:

Technical Metrics:

Rollback Triggers (Automatic):

Success Criteria:

Step 7: Full Deployment (7-day ramp)

Gradual Rollout Schedule:

Why Gradual:

Monitoring at Each Stage:

Model Archival:

Pipeline Completion:

    
    graph TB
    TRIGGER[Retraining Trigger
Weekly or drift detected] DATA[Data Collection
90 days, 10M samples
30 min] VALIDATE[Data Validation
Nulls, outliers, drift
10 min] TRAIN[Model Training
LightGBM + grid search
2-4 hours] EVAL[Model Evaluation
AUC ≥ 0.78?
Calibration OK?] SHADOW[Shadow Deployment
10% traffic, 24 hours
Compare vs current] CANARY[Canary Deployment
10% production
48 hours] FULL[Full Deployment
100% traffic
7-day ramp] FAIL[Reject Model
Investigate + retry] TRIGGER --> DATA DATA --> VALIDATE VALIDATE --> TRAIN TRAIN --> EVAL EVAL -->|Pass| SHADOW EVAL -->|Fail| FAIL SHADOW -->|Healthy| CANARY SHADOW -->|Issues| FAIL CANARY -->|Healthy| FULL CANARY -->|Issues| FAIL style EVAL fill:#ffffcc style FAIL fill:#ffe6e6 style FULL fill:#e6ffe6

A/B Testing Framework: Statistical Rigor for Model Comparison

Purpose:

A/B testing validates that new model versions improve business outcomes with statistical confidence before full deployment.

Framework Design:

Traffic Splitting:

Metrics Tracked:

Primary Metric (Decision Criterion):

Secondary Metrics (Health Checks):

Statistical Significance:

Hypothesis Test:

Minimum Detectable Effect (MDE):

Winner Selection Criteria:

Model v1.3.0 wins if:

  1. Statistical significance: p-value < 0.05 (Treatment significantly better than Control)
  2. Practical significance: Treatment eCPM ≥ Control eCPM + 1% (minimum meaningful improvement)
  3. Safety checks: All secondary metrics within acceptable bounds

Example Result:

Guardrail Metrics:

Even if eCPM improves, reject model if:

Model Versioning & Rollback Strategy

Versioning Scheme:

Models use timestamp-based versioning (YYYY-MM-DD-HH) for chronological ordering without semantic version complexity. Each version includes the model binary, metadata (AUC, calibration metrics, hyperparameters), and feature list. Storage in S3 with 30-day retention balances rollback capability against storage costs, with last 3 production-stable models (deployed ≥7 days without incidents) retained indefinitely as ultimate fallback.

Fast Rollback Architecture:

Model servers poll configuration every 30 seconds, enabling sub-2-minute rollback when production metrics degrade. Configuration update triggers graceful model reload: in-flight requests complete with current model while new requests route to previous version loaded from S3 (10-second fetch). Total rollback time averages 70 seconds (30s config poll + 10s model load + 30s verification).

Rollback Triggers:

    
    graph LR
    DEPLOY[New Model Deployed
v2025-11-19-14] MONITOR[Monitor Metrics
Latency Error Rate Revenue] DEGRADED{Degradation
Detected?} ROLLBACK[Rollback Triggered
Load v2025-11-12-08] RELOAD[Servers Reload
70 sec transition] VERIFY[Verify Recovery
Metrics normalized] CONTINUE[Continue Monitoring
Model stable] DEPLOY --> MONITOR MONITOR --> DEGRADED DEGRADED -->|Yes
Threshold exceeded| ROLLBACK DEGRADED -->|No
Within SLA| CONTINUE ROLLBACK --> RELOAD RELOAD --> VERIFY VERIFY --> MONITOR CONTINUE --> MONITOR style DEPLOY fill:#e1f5ff style DEGRADED fill:#fff4e6 style ROLLBACK fill:#ffe6e6 style VERIFY fill:#e6ffe6 style CONTINUE fill:#e6ffe6

Cross-References:

Production MLOps Summary:

This monitoring and retraining infrastructure ensures model quality remains high despite natural drift. The 7-step automated pipeline, combined with multi-signal drift detection, maintains AUC ≥ 0.75 with minimal manual intervention. A/B testing provides statistical rigor for model comparisons, while fast rollback (< 5 min) protects against bad deployments.

Key Insight: Production ML is an ongoing engineering challenge, not a one-time deployment. Without continuous monitoring and automated retraining, model accuracy degradation costs 8-12% revenue within 12 weeks. The investment in MLOps infrastructure (1-2 engineers for 2-3 months + minimal ongoing infrastructure cost) pays for itself within 2-3 months through prevented revenue loss.


Summary: The Revenue Engine in Action

This post detailed the dual-source architecture combining real-time bidding with ML-powered internal inventory within 150ms latency.

Architecture:

Parallel paths (run simultaneously):

Total: 143ms average (7ms safety margin from 150ms SLO)

Business Impact:

ApproachRevenueFill RateProblem
RTB only70% baseline35%Blank ads, poor UX
Internal only52% baseline100%Misses market pricing
Dual-sourceBaseline100%30-48% lift vs single-source

Key Decisions:

  1. GBDT over neural nets: 20-40ms CPU inference vs 10-20ms GPU at 6-10× cost. Cost-efficiency wins at 1M QPS.

  2. Feature Store (Tecton): Pre-computed aggregations serve in 10ms p99 vs 50-100ms direct DB queries. Trades storage for latency.

  3. 100ms RTB timeout: Industry standard balances revenue (more DSPs) vs latency. Geographic sharding required (NY-Asia: 200-300ms RTT impossible otherwise).

Core Insights:


Back to top