Caching, Auctions & Budget Control: Revenue Optimization at Scale

Introduction: Where Data Meets Revenue

Real-time ad platforms operate under extreme constraints: serve 1M+ queries per second, respond in under 150ms, run ML inference and external auctions, and maintain perfect financial accuracy. The revenue engine (RTB + ML inference) generates the bids, but three critical data systems determine whether the platform succeeds or fails:

The three data challenges that make or break ad platforms:

  1. Cache performance: Can we serve 1M QPS without overwhelming the database?

    • Problem: Database reads take 40-60ms. At 1M QPS, that’s 40-60K concurrent DB connections.
    • Constraint: Only 10ms latency budget for user profile and feature lookups
    • Solution needed: Multi-tier caching with 85%+ cache hit rate (only 15% query database)
  2. Auction fairness: How do we compare CPM bid with CPC bid - which is worth more?

    • Problem: Different pricing models (CPM/CPC/CPA) aren’t directly comparable
    • Constraint: Must rank all ads fairly to maximize revenue
    • Solution needed: eCPM normalization using predicted CTR
  3. Budget accuracy: How do we prevent overspend across 300 distributed ad servers?

    • Problem: Each server independently serves ads, but budgets must be enforced globally
    • Constraint: Can’t centralize every spend decision (creates bottleneck + latency)
    • Solution needed: Distributed atomic counters with proven accuracy bounds

Why these systems are interdependent:

Every ad request follows this critical path:

Miss any of these and revenue suffers:

What this post covers:

This post builds the three data systems that enable revenue optimization:

Broader applicability:

These patterns - multi-tier caching, fair comparison across heterogeneous inputs, distributed atomic operations with bounded error - apply beyond ad tech. High-throughput systems with strict latency budgets and financial accuracy requirements face similar challenges:

The core insight is how these three systems integrate to deliver both speed (sub-10ms data access) and accuracy (≤1% financial variance) at massive scale (1M+ QPS).

Let’s explore how each system is designed and how they work together.

Distributed Caching Architecture

Multi-Tier Cache Hierarchy

To achieve high cache hit rates with sub-10ms latency, implement two cache tiers plus database (target: 85% cache hit rate avoiding database queries, with 25% L2 coverage):

Technology Selection: Cache Tier Choices

L1 Cache Options:

TechnologyLatencyThroughputMemoryProsCons
Caffeine (JVM)~1μs10M ops/secIn-heapWindow TinyLFU eviction, lock-free readsJVM-only, GC pressure
Guava Cache~1.5μs5M ops/secIn-heapSimple API, widely usedLRU only, lower hit rate
Ehcache~1.5μs8M ops/secIn/off-heapOff-heap option reduces GCMore complex configuration

Decision: Caffeine - Superior eviction algorithm (Window TinyLFU) yields 10-15% higher hit rates than LRU-based alternatives. Benchmarks show ~2x throughput vs. Guava.

L2 Cache: Redis vs Memcached

The L2 cache choice came down to one requirement: atomic operations for budget counters. Memcached is faster (3ms vs 5ms p99) and cheaper (~30% less memory), but it can’t do DECRBY/INCRBY atomically. Without atomic operations, budget counters would have race conditions - multiple servers could allocate from stale budget values, causing unbounded over-delivery.

Redis also gives us:

The 30% memory premium over Memcached is worth it to avoid budget race conditions. Hazelcast (8ms latency) was too slow to consider seriously.

Valkey Alternative (Redis Fork):

In 2024, Redis Labs changed licensing from BSD to dual-license (SSPL + proprietary), creating uncertainty for commercial users. The Linux Foundation forked Redis into Valkey with permissive BSD-3 license:

Recommendation: Use Valkey for new deployments to avoid licensing ambiguity. Migration from Redis is trivial (same protocol, same commands, same performance).

L3 Persistent Store Options:

Note: Write throughput numbers reflect cluster-level performance at production scale (20-80 nodes for distributed databases). Single-node performance is 5-20K writes/sec (SSD RAID10, 32GB RAM) depending on workload characteristics.

TechnologyRead Latency (p99)Write Throughput
(cluster-level)
ScalabilityConsistencyCross-Region ACIDHLC Built-inProsCons
CockroachDB10-15ms400K writes/sec
(60-80 nodes)
Horizontal (Raft)SerializableYesYesSQL, JOINs, multi-region transactionsOperational complexity (self-hosted)
YugabyteDB10-15ms400K writes/sec
(60-80 nodes)
Horizontal (Raft)SerializableYesYesPostgreSQL-compatibleSmaller ecosystem
Cassandra20ms500K writes/sec
(100+ nodes)
Linear (peer-to-peer)Tunable (eventual)NoNoMulti-DC, matureNo JOINs, eventual consistency
PostgreSQL15ms50K writes/sec
(single node)
Vertical + shardingACIDNoNoSQL, JOINs, strong consistencyManual sharding complex
DynamoDB10ms1M writes/sec
(auto-scaled)
Fully managedStrong per-region
MRSC (2024)
NoNoAuto-scaling, fully managedNo cross-region transactions, no JOINs, NoSQL limitations

Why CockroachDB

The persistent store must handle 400M user profiles (4TB+) with strong consistency for billing data. While Cassandra offers higher write throughput (500K vs 400K writes/sec) and battle-tested scale, eventual consistency is problematic for financial data and would require custom HLC implementation, reconciliation logic, and auditor explanations.

CockroachDB advantages:

Why Not DynamoDB?

Despite being fully managed and highly scalable, DynamoDB lacks critical features for our financial accuracy requirements:

  1. No cross-region ACID transactions: DynamoDB’s 2024 MRSC feature provides strong consistency for reads/writes within each region, but transactions (TransactWriteItems) only work within a single region. Budget enforcement requires atomic operations across user profiles + campaign ledger + audit log - this cannot be guaranteed across regions.

  2. No HLC or causal ordering: DynamoDB uses “last writer wins” conflict resolution based on internal timestamps. Without HLC, we can’t guarantee causal ordering across regions for financial audit trails. Example failure: Budget update in us-east-1 and spend deduction in eu-west-1 arrive out-of-order, causing temporary overspend that violates financial accuracy constraints.

  3. NoSQL limitations: No SQL JOINs, no complex queries. Ad selection queries like “find all active campaigns for advertiser X targeting users in age group Y with budget remaining > Z” require multiple round-trips and application-level joins, adding latency and complexity.

  4. Schema evolution complexity: Requires dual-write patterns and application-level migration logic. CockroachDB supports online schema changes (ALTER TABLE without blocking).

DynamoDB is excellent for:

Alternatives:

Database cost comparison at 8B requests/day (Nov 2024 pricing):

Database OptionRelative CostOperational ModelTrade-offs
DynamoDB100% (baseline)Fully managed (AWS)No cross-region transactions, NoSQL limitations, vendor lock-in
CockroachDB Serverless80-100% of DynamoDBFully managed (Cockroach Labs)Pay-per-use, auto-scaling, same features as self-hosted
CockroachDB Dedicated60-80% of DynamoDBManaged by Cockroach LabsReserved capacity, SLAs, predictable pricing
CockroachDB Self-Hosted40-50% of DynamoDB (infra only)Self-managedLowest infra cost, requires dedicated ops team (cost varies by geography/expertise)
PostgreSQL (sharded)30-40% of DynamoDB (infra only)Self-managedNo native multi-region, complex sharding, no HLC

Note: AWS reduced DynamoDB on-demand pricing by 50% in November 2024, significantly improving its cost competitiveness. CockroachDB Dedicated still offers savings, but the gap narrowed considerably.

Key insight: CockroachDB Dedicated provides 20-40% cost savings over DynamoDB while maintaining full feature parity (cross-region transactions, HLC, SQL) without operational overhead. Serverless pricing is now comparable to DynamoDB due to recent AWS price reductions. Self-hosted CockroachDB provides 50-60% savings (2-2.5× cheaper) but requires operational expertise.

Decision Framework: Avoiding “Spreadsheet Engineering”

The comparison above shows infrastructure costs only. Here’s the complete decision framework:

For most teams (< 5B requests/day): Choose CockroachDB Dedicated or DynamoDB

Reasons:

For high-scale teams: Self-Hosted Break-Even Analysis

Self-hosted becomes economically viable when infrastructure savings exceed operational costs. The break-even point varies significantly based on team structure and geography.

Break-even formula:

$$\text{Break-even QPS} = \frac{\text{Annual SRE Cost}}{\text{Cost Savings per Request} \times \text{Requests per Year}}$$

Example calculation at 8B requests/day:

Operational cost scenarios:

Define SRE cost baseline as 1.0× = fully loaded senior SRE in high-cost region (California/NYC/Seattle).

Team StructureAnnual SRE Cost (relative)Break-Even Daily RequestsNotes
US Team: 3-5 SREs3.0-5.1× baseline20-30B req/dayHigh-cost regions: California, NYC, Seattle
Global Team: 2-3 SREs1.1-1.8× baseline8-12B req/dayMixed US/Eastern Europe, leveraging time zones
Regional Team: 2 SREs0.5-0.9× baseline4-8B req/dayEastern Europe/India/LatAm rates, experienced engineers
Existing Expertise: +1 SRE0.35-0.7× baseline2-5B req/dayMarginal cost when team already has database expertise

Key variables affecting break-even:

  1. Geographic SRE costs: 0.18-0.55× baseline (non-US regions) vs 1.0× baseline (US high-cost)
  2. Team efficiency: 1-2 experienced SREs with automation vs 3-5 without
  3. Existing expertise: If team already operates databases, marginal cost is lower
  4. Tooling maturity: CockroachDB Dedicated (managed but self-deployed) vs full self-hosted

When self-hosted may make sense:

When managed options are preferred:

Why DynamoDB remains a valid choice despite limitations:

For workloads that don’t require:

DynamoDB’s operational simplicity (zero management) may outweigh feature limitations. Many ad tech companies successfully use DynamoDB by:

Our choice: CockroachDB Serverless for Day 1, evaluate self-hosted only if we reach 15-25B+ requests/day with dedicated ops team.

    
    graph TB
    subgraph "Request Flow"
        REQ[Cache Request
user_id: 12345] end subgraph "L1: In-Process Cache" L1[Caffeine JVM Cache
10-second TTL
1μs lookup
100MB per server] L1_HIT{Hit?} L1_STATS[L1 Statistics
Hit Rate: 60%
Avg Latency: 1μs] end subgraph "L2: Distributed Cache" L2[Redis Cluster
30-second TTL
5ms lookup
800GB usable capacity] L2_HIT{Hit?} L2_STATS[L2 Statistics
Hit Rate: 35%
Avg Latency: 5ms] end subgraph "L3: Persistent Store" L3[CockroachDB Cluster
Multi-Region ACID
10-15ms read
Strong Consistency] L3_STATS[L3 Statistics
Hit Rate: 5%
Avg Latency: 12ms] end subgraph "Hot Key Detection" MONITOR[Stream Processor
Kafka Streams
Count-Min Sketch] REPLICATE[Dynamic Replication
3x copies for hot keys] end REQ --> L1 L1 --> L1_HIT L1_HIT -->|60% Hit| RESP1[Response
~1μs] L1_HIT -->|40% Miss| L2 L2 --> L2_HIT L2_HIT -->|35% Hit| POPULATE_L1[Populate L1] POPULATE_L1 --> RESP2[Response
~5ms] L2_HIT -->|5% Miss| L3 L3 --> POPULATE_L2[Populate L2 + L1] POPULATE_L2 --> RESP3[Response
~20ms] L2 -.->|0.1% sampling| MONITOR MONITOR -.->|Detect hot keys| REPLICATE REPLICATE -.->|Replicate to nodes| L2 subgraph "Overall Performance" PERF[Total Hit Rate: 95%
Average Latency: 2.75ms
p99 Latency: 25ms] end classDef cache fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef source fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef monitor fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px class L1,L2 cache class L3 source class MONITOR,REPLICATE monitor

GDPR Right-to-Deletion Implementation

Architectural Driver: Legal Compliance - GDPR Article 17 mandates deletion within 30 days, but industry practice expects 7-14 days. With user data distributed across CockroachDB, Valkey, S3 Parquet files, and ML model weights, deletion requires a coordinated three-step workflow.

Regulatory context: GDPR Article 17 “Right to Erasure” requires organizations to delete personal data “without undue delay” - interpreted as 30 days maximum by regulators, but major platforms (Google, Meta) complete deletions in 7-14 days, setting user expectations higher than legal minimums.

Technical challenge: User data doesn’t live in one database - it’s distributed across operational stores, caches, cold storage, and ML models. Deleting from all locations requires coordinating multiple systems with different deletion mechanisms.

Data Distribution Challenge

Where User Data Lives:

a. Operational Databases (CockroachDB)

b. Cache Layers (Valkey + Caffeine)

c. Data Lake (S3 Parquet Files)

d. ML Training Data

Step 1: Real-Time Deletion (< 1 Hour)

Goal: Stop serving user data immediately after deletion request

a. Mark User as Deleted in CockroachDB

Deletion strategy: Tombstone approach - mark as deleted and nullify personal fields, keeping non-personal audit data.

Database operation (conceptual example - production tables may have different schemas):

-- Idea: Keep audit trail, nullify personal data
UPDATE user_profiles
SET deleted_at = NOW(),           -- Mark deletion timestamp
    demographics = NULL,          -- Remove personal field
    interests = NULL,             -- Remove personal field
    browsing_history = NULL       -- Remove personal field
    -- Keep: user_id (pseudonymous identifier)
    -- Keep: created_at, account_tier (non-personal audit fields)
WHERE user_id = 'xxx';

Why this approach:

Real schema note: Actual production tables may have 50-100+ columns. The key principle: nullify all columns containing personal data (PII), keep system fields needed for audit, billing reconciliation, and referential integrity.

Latency: 10-15ms (single database write with strong consistency)

b. Invalidate All Cache Tiers

L1 Caffeine Cache Invalidation:

L2 Valkey Cache Invalidation:

Why pub/sub for L1, direct DEL for L2:

c. Add to Deletion Tombstone List

Bloom Filter Implementation:

Why Bloom filter:

Result: User data no longer served within 1 hour (Caffeine cache TTL = 10 seconds, but propagation across 300 instances takes up to 60 seconds)

GDPR compliance: “Without undue delay” satisfied (1 hour is acceptable, regulators expect days not hours)

Deletion Workflow Diagram:

    
    graph TB
    REQUEST[User Deletion Request
GDPR Article 17] subgraph "Step 1: Real-Time (< 1 Hour)" DB[CockroachDB
SET deleted_at=NOW, data=NULL] L1[L1 Cache Invalidation
Pub/sub to 300 instances] L2[L2 Cache Invalidation
DEL user:xxx:profile] BLOOM[Add to Bloom Filter
deleted_users] end subgraph "Step 2: Batch Deletion (7-30 Days)" TIER1[Tier 1: 0-90 days
Parquet rewrite
True deletion] TIER2[Tier 2: 90d-2yr
Tombstone markers
Pseudonymization] TIER3[Tier 3: 2+ years
S3 object delete
Glacier cleanup] end subgraph "Step 3: ML Training Data" AGGREGATE[Aggregate Defense
Do NOT retrain
Legal: < 0.0001% contribution] end subgraph "Audit Trail" LOG[Immutable Deletion Log
CockroachDB append-only
7-year retention] end REQUEST --> DB REQUEST --> L1 REQUEST --> L2 REQUEST --> BLOOM DB --> TIER1 DB --> TIER2 DB --> TIER3 DB --> AGGREGATE REQUEST --> LOG style DB fill:#ffcccc style BLOOM fill:#ffdddd style AGGREGATE fill:#ffffcc style LOG fill:#e6ffe6

Step 2: Batch Deletion (7-30 Days)

Goal: Purge historical data from data lake

Challenge: Parquet Immutability

Parquet format characteristics:

Options: Rewrite vs Tombstone

Option A: Tombstone Markers (Preferred for Cost)

Concept: Instead of physically deleting data from immutable Parquet files, maintain a separate “deletion marker” table and filter deleted users at query time.

Implementation:

The pattern is straightforward: maintain a compact deleted_users table (in CockroachDB) that stores (user_id, deleted_at, deletion_request_id) tuples. When a deletion request arrives, insert a marker row. Historical Parquet files in S3 remain unchanged—no expensive rewrites needed.

Query-time filtering: Analytics queries join against the deletion marker table to exclude deleted users. For example, a LEFT OUTER JOIN with a WHERE deleted_users.user_id IS NULL clause filters out any user who has a deletion marker. Production pipelines encapsulate filtering in views/CTEs (best practice) so every query doesn’t repeat the JOIN logic:

This approach balances GDPR compliance (data becomes inaccessible in analytics) with cost efficiency (no Parquet rewrites).

The key principle: Query-time filtering via JOIN against deletion marker table, not physical deletion from Parquet.

Trade-offs:

Option B: Parquet Rewrite (True Deletion)

Implementation:

  1. Read Parquet file → filter out deleted user rows → write new file
  2. Replace old file with new file in S3
  3. Delete old file

Cost analysis:

Recommended Tiered Approach:

Data AgeMethodRationale
0-90 days (Tier 1)Parquet rewriteRecent data = regulatory scrutiny, true deletion required
90d-2yr (Tier 2)Tombstone markersArchived data, pseudonymization acceptable
2+ years (Tier 3)True deletion (S3 object delete)Cold storage (Glacier), infrequently accessed, delete entire daily files older than 2 years

Timeline:

Step 3: ML Training Data (300-400 words)

Challenge: User data embedded in model weights

Problem:

Option A: Retrain Without User (Impractical)

Option B: Model Unlearning (Research Area, Not Production-Ready)

Option C: Aggregate Defense (Practical, Legally Defensible)

Legal Rationale:

Implementation:

Trade-off Disclosure:

Recommendation:

Audit Trail

Requirement: Prove deletion occurred (for regulatory audits and advertiser disputes)

Implementation:

a. Immutable Deletion Log

b. Retention Period

c. Compliance Reporting

Data Residency (EU Users)

GDPR Requirement: EU user data must stay in EU region (no cross-border transfer to US)

CockroachDB Implementation:

REGIONAL BY ROW Pattern:

CockroachDB’s REGIONAL BY ROW locality pattern enables GDPR-compliant data residency by pinning each row to its home region based on a column value.

Conceptual schema example (simplified for illustration - production schemas have 50-100+ columns):

-- Example: Configure table to use regional locality
ALTER TABLE user_profiles
SET LOCALITY REGIONAL BY ROW AS region;

-- The 'region' column determines physical storage location
-- CockroachDB automatically routes queries to correct region

Minimal example columns (real tables have many more fields):

Production schema note: Real user_profiles tables typically have 50-100+ columns including timestamps, account metadata, consent flags, privacy settings, feature flags, and audit fields. This example shows only the essential concept: the region column controls physical data placement.

How it works:

Valkey (Redis) Partitioning:

Separate Clusters per Region:

Latency Impact of Data Residency:

Cross-Region Request Scenario:

Mitigation:

S3 Data Lake Residency:

Data Residency Enforcement Diagram:

    
    graph TB
    subgraph "EU Region (eu-west-1)"
        EU_USER[EU User Request]
        EU_GW[EU Gateway]
        EU_CRDB[(CockroachDB EU Nodes
REGIONAL BY ROW: 'eu')] EU_VALKEY[(Valkey EU Cluster
EU cache only)] EU_S3[(S3 EU Bucket
No cross-region replication)] end subgraph "US Region (us-east-1)" US_USER[US User Request] US_GW[US Gateway] US_CRDB[(CockroachDB US Nodes
REGIONAL BY ROW: 'us')] US_VALKEY[(Valkey US Cluster
US cache only)] US_S3[(S3 US Bucket
No cross-region replication)] end EU_USER -->|GeoDNS routes to EU| EU_GW EU_GW --> EU_CRDB EU_GW --> EU_VALKEY EU_CRDB -.-> EU_S3 US_USER -->|GeoDNS routes to US| US_GW US_GW --> US_CRDB US_GW --> US_VALKEY US_CRDB -.-> US_S3 EU_CRDB -.->|NO cross-region replication| US_CRDB EU_S3 -.->|NO cross-region replication| US_S3 style EU_CRDB fill:#cce5ff style EU_VALKEY fill:#cce5ff style EU_S3 fill:#cce5ff style US_CRDB fill:#ffe5cc style US_VALKEY fill:#ffe5cc style US_S3 fill:#ffe5cc

Subsection Conclusion

GDPR right-to-deletion requires three-step workflow:

  1. Real-time (< 1 hour): CockroachDB nullification, cache invalidation (L1 pub/sub + L2 DEL), Bloom filter tombstone
  2. Batch deletion (7-30 days): Tiered approach (Parquet rewrite for recent data, tombstones for archives, full deletion for cold storage)
  3. ML training data: Aggregate defense (legally defensible, cost-efficient, individual contribution < 0.0001%)

Audit trail: Immutable deletion logs (7-year retention), monthly compliance reports, annual auditor review

Data residency: CockroachDB REGIONAL BY ROW + regional Valkey clusters enforce GDPR data locality (EU data stays in EU, US data stays in US)

Trade-offs acknowledged:

Cross-references:

Legal disclaimer: This implementation reflects common industry practice and 2025 GDPR interpretation, but is not formal legal advice. The ML model “aggregate defense” approach (not retraining on deletion) is based on GDPR Article 11’s infeasibility exception, but has not been formally adjudicated by courts. Individual circumstances vary - organizations must consult qualified data privacy counsel for legal guidance specific to their jurisdiction and use case. The regulatory landscape continues to evolve, and annual legal review with external counsel is strongly recommended.

Cache Performance Analysis

Cache Architecture Clarification:

The system has two cache tiers plus the database:

Cache Hit Rate Calculation:

Let \(H_i\) be the conditional hit rate of cache tier \(i\):

$$H_{cache} = H_1 + (1 - H_1) \times H_2$$

Target configuration (25% L2 coverage as shown in optimization table below):

Data Availability:

Of the 15% requests that miss both caches and query the database:

Effective data found rate: 99.85% (85% from cache + 14.85% from database)

Average Latency:

$$\mathbb{E}[L] = H_1 L_1 + (1-H_1)H_2 L_2 + (1-H_1)(1-H_2) L_{db}$$

With latencies \(L_1 = 0.001ms\), \(L_2 = 5ms\), \(L_{db} = 20ms\):

$$\mathbb{E}[L] = 0.60 \times 0.001 + 0.40 \times 0.625 \times 5 + 0.40 \times 0.375 \times 20 = 4.25ms$$

Key Insight: 85% cache hit rate means only 15% of requests query the database (20ms penalty). This is the critical metric - not whether data exists (which is ~100% for established users), but whether we can serve it from cache.

Cache Cost Optimization: The Economic Tradeoff

Architectural Driver: Financial Accuracy + Latency - Cache sizing is not just a performance problem but an economic optimization. At scale, every GB of Redis costs money, every cache miss hits the database (cost + latency), and every millisecond of added latency costs revenue. The optimal cache size balances these three factors.

The Fundamental Tradeoff:

At 1M QPS with 400M users, cache sizing decisions have massive financial impact:

Cost Model:

The total cost function combines three components:

$$C_{total} = C_{cache}(S) + C_{db}(S) + C_{latency}(S)$$

where \(S\) = cache size (GB)

Component 1: Cache Memory Cost

$$C_{cache}(S) = S \times P_{memory} \times N_{nodes}$$

where:

Cache pricing note: Managed cache services (ElastiCache, Valkey) cost 10-12× per GB compared to self-hosted instances. Self-hosted Redis on standard instances is cheaper but adds operational overhead.

Example: 1000 nodes × 16GB/node × baseline GB-month rate = baseline cache cost

Component 2: Database Query Cost

Cache misses hit CockroachDB, which costs both compute and I/O:

$$C_{db}(S) = Q_{total} \times (1 - H(S)) \times C_{query}$$

where:

Example: 2.6B queries/month × 5% miss rate × baseline query cost = query cost component

Component 3: Revenue Loss from Latency

Every cache miss adds ~15ms latency (database read vs cache hit). As established in Part 1, Amazon’s study found 100ms latency = 1% revenue loss.

$$C_{latency}(S) = R_{monthly} \times (1 - H(S)) \times \frac{\Delta L}{100ms} \times 0.01$$

where:

Example: Revenue baseline × 5% miss rate × (15ms/100ms) × 1% = latency cost component

Modeling User Access Patterns: Why Zipfian Distribution?

Real-world user access patterns in web systems follow a power law distribution, not a uniform distribution. A small fraction of users (or items) account for a disproportionately large fraction of traffic.

Zipfian distribution (named after linguist George Zipf) models this phenomenon:

Why Zipfian over alternatives:

DistributionWhen It AppliesWhy NOT for Cache Sizing
UniformAll items accessed equallyUnrealistic - power users exist, not all users access platform equally
Normal (Gaussian)Symmetric data around meanUser access has long tail, not bell curve. Most users low-activity, few users very high-activity
ExponentialTime between eventsModels timing/intervals, not popularity ranking
Zipfian (power law)Popularity rankingMatches empirical data (validated below)

Empirical validation for ad platforms:

Parameter choice: \(\alpha = 1.0\) (classic Zipf’s law) is standard for web caching literature. Higher \(\alpha\) (e.g., 1.5) means more concentration at the top; lower \(\alpha\) (e.g., 0.7) means flatter distribution.

Hit Rate as Function of Cache Size (Zipfian Distribution):

User access follows Zipfian distribution with \(\alpha = 1.0\) (power law):

$$P(\text{rank } r) = \frac{1/r}{\sum_{i=1}^{N} 1/i} \approx \frac{1}{r \times \ln(N)}$$

Cache hit rate:

$$H(S) = \frac{\text{\# of cached items}}{\text{Total items}} \times \text{Access weight}$$

For Zipfian(\(\alpha=1.0\)) with realistic LRU cache behavior:

Cache CoverageL2-Only Hit Rate (Theoretical)Cumulative L1+L2 (Realistic)Cache Size
Top 1%40-45%55-60%40GB
Top 5%55-60%65-70%200GB
Top 10%65-70%75-80%400GB
Top 20%68-78%78-88%800GB (optimal)
Top 40%78-85%90-95%1.6TB

Key insight: Zipfian distribution means diminishing returns after ~20% coverage.

Note: “Cumulative L1+L2” includes L1 in-process cache (60% hit rate on hot data) plus L2 distributed cache. L2-only rates assume LRU eviction (0.85× theoretical LFU performance). See detailed validation methodology below for calculation derivation.

Marginal Cost Analysis:

The optimal cache size occurs where marginal cost equals marginal benefit:

$$\frac{dC_{total}}{dS} = 0$$

Marginal cost (adding 1 GB of cache): $$MC_{cache} = 1GB \times P_{memory} \times N_{nodes}$$

Marginal benefit (hit rate improvement):

For Zipfian distribution, adding cache beyond 20% coverage yields <0.5% hit rate improvement:

$$MB = \Delta H \times (C_{db} + C_{latency})$$

Example:

Not worth it - marginal cost far exceeds marginal benefit beyond 20% coverage.

Optimal Cache Size Calculation:

Given our constraints:

Optimize:

$$\min_{S} \left[ C_{cache}(S) + C_{db}(S) + C_{latency}(S) \right]$$

Subject to:

Solution (relative costs as % of total caching infrastructure):

L2 Cache Size (% of 4TB total)Cumulative L1+L2 Hit RateCost Breakdown (relative %)Total Cost vs Baseline*Analysis
5% (200GB)65-70%Cache: 15%, DB: 54%, Latency: 31%100% (baseline)High DB+latency penalties
10% (400GB)75-80%Cache: 37%, DB: 40%, Latency: 23%81%Better balance
20% (800GB)78-88%Cache: 74%, DB: 16%, Latency: 10%80% (optimal)Best total cost
40% (1.6TB)90-95%Cache: 93%, DB: 5%, Latency: 2%128%Expensive for marginal gain

*Total cost relative to 5% coverage baseline (100%). Lower is better.

Optimal choice: 20% coverage (800GB L2 cache)

Hit Rate Validation Methodology

Why Zipf Distribution Applies:

User access patterns in digital systems follow power-law distributions (Zipf-like): a small fraction of users generate disproportionate traffic. Research shows:

Zipf Distribution Definition:

For N total items (users), the probability of accessing item ranked i is:

$$P(i) = \frac{1/i^{\alpha}}{\sum_{j=1}^{N} 1/j^{\alpha}} = \frac{1/i^{\alpha}}{H(N, \alpha)}$$

where \(H(N, \alpha)\) is the generalized harmonic number (normalization constant).

Cache Hit Rate Calculation:

For a cache holding the top C most popular items (LFU/static caching):

$$\text{Hit Rate} = \frac{\sum_{i=1}^{C} P(i)}{\sum_{i=1}^{N} P(i)} = \frac{H(C, \alpha)}{H(N, \alpha)}$$

Step-by-Step for Our System:

Parameters:

Step 1: Calculate harmonic numbers

For α=1.0, \(H(N, 1) \approx \ln(N) + \gamma\) where γ ≈ 0.5772 (Euler-Mascheroni constant)

Step 2: Calculate base hit rate (L2 cache only)

$$\text{L2 Hit Rate} = \frac{18.8}{20.4} \approx 0.92 \text{ or } 92%$$

Wait, this seems too high! The issue: this assumes perfect LFU and independent requests.

Step 3: Apply real-world corrections

Real systems deviate from theoretical Zipf:

  1. Imperfect ranking: LRU (Least Recently Used) cache doesn’t perfectly track popularity

    • LRU hit rate ≈ 0.8-0.9 × LFU theoretical rate (Berger et al. 2015)
    • Correction factor: 0.85
  2. Temporal clustering: User sessions create bursts

    • Positive effect: L1 cache absorbs repeated requests within sessions
    • L1 adds +10-15% effective hit rate on top of L2
  3. Workload variation: α varies by vertical (e-commerce vs gaming)

    • α = 0.9-1.1 typical range
    • Lower α → flatter distribution → lower hit rate

Step 4: Combined L1 + L2 hit rate

L2 realistic hit rate: \(0.92 \times 0.85 \approx 0.78\) (78%)

L1 contribution: Caffeine in-process cache with 60% hit rate captures hot subset

Combined rate: \(H_{total} = H_{L1} + (1 - H_{L1}) \times H_{L2}\)

$$H_{total} = 0.60 + (1 - 0.60) \times 0.78 = 0.60 + 0.31 = 0.91 \text{ or } 91%$$

But: L1 size is tiny (2-4GB), only caches ~1M hottest users (0.25% coverage)

Recalculating with realistic L1:

Wait, still too high compared to our 78-88% claim!

Step 5: Conservative adjustments

To get 78-88% range, we account for:

  1. Worst-case α = 0.9 (flatter distribution than α=1.0)

    • Recalculating with α=0.9: \(H(80M, 0.9) / H(400M, 0.9) \approx 0.88\)
    • With 0.85 LRU correction: \(0.88 \times 0.85 \approx 0.75\) (75%)
    • Plus L1 (60%): \(0.60 + 0.40 \times 0.75 = 0.90\) (still 90%!)
  2. Real issue: Our 20% L2 coverage doesn’t cache top 80M individual users

    • Reality: L2 caches ~800GB of serialized profile data
    • Average profile size: ~1-10KB depending on richness
    • Effective user coverage: 80M - 800M users depending on profile size
    • If profiles avg 4KB: 800GB / 4KB = 200M users (50% coverage, not 20%!)

Reconciliation: The “20% coverage” refers to storage capacity (800GB / 4TB), not user count!

With 50% user coverage (C = 200M):

Conservative range 78-88%:

Validation sources:

Trade-off accepted: We choose 20% coverage (800GB distributed across cluster) because:

  1. Lowest total cost: Optimal point on cost curve (80% of 5%-coverage baseline)
  2. 78-88% cache hit rate meets 80%+ requirement with safety margin (mid-range = 83%)
  3. Only 12-22% requests incur database query penalty (acceptable for 20ms budget)
  4. Latency cost minimized (reduces latency penalty 59% vs 10% coverage)
  5. Worth paying higher cache cost to save significantly on database and latency costs

TTL Optimization: Freshness vs Hit Rate Tradeoff

Time-to-live (TTL) settings create a second optimization problem:

Staleness Cost Model:

$$C_{staleness} = P(\text{stale}) \times C_{error}$$

For user profiles:

Example: 30s TTL

Example: 300s TTL

Optimal TTL: 30-60 seconds

Balances freshness cost with reasonable hit rate. Longer TTLs increase staleness cost 10×.

Multi-Tier Architecture: Performance vs Complexity Trade-off

Question: Does adding L1 in-process cache (Caffeine) justify the added complexity?

L1 Cache Overhead:

L1 Cache Benefits:

Performance gains:

At 150ms total latency budget, 3ms represents ~2% improvement - marginal performance benefit.

However: L1 cache provides critical resilience during L2 failures:

ScenarioL1 CacheImpact
Redis healthy60% L1 hit, 40% L2 hitOptimal latency
Redis degraded
(p99 >15ms)
60% L1 hit, 40% cold start-4-6% targeting accuracy, system stays online
Redis down60% L1 hit, 40% databaseDatabase load manageable (40% instead of 100%)
No L1 cache100% cache miss on Redis failureDatabase overload → cascading failure

Decision: Keep L1 for resilience and fault tolerance, not performance optimization. The 2% CPU overhead is insurance against catastrophic L2 cache failures.

Cost Summary (relative to total caching infrastructure):

ComponentRelative CostNotes
L1 Cache (Caffeine)~0%In-process, negligible memory
L2 Cache (Redis/Valkey)58%800GB at 20% coverage, 78-88% hit rate
L3 Database infrastructure (CockroachDB)22-29%60-80 nodes baseline
Database query cost (cache misses)13%12-22% miss rate × query volume
Cache miss latency cost8%Revenue loss from slow queries
Total caching infrastructure100%Optimized for 78-88% hit rate at 20% coverage

Alternative (no caching):

Savings from caching: 73-75% cost reduction vs no-cache alternative

Redis Cluster: Consistent Hashing and Sharding

Cluster Configuration:

Hash Slot Assignment:

For key \(k\), compute hash: $$\text{slot}(k) = \text{CRC16}(k) \mod 16384$$

Slot-to-node mapping maintained in cluster state.

Virtual Nodes:

Each physical node handles \(\frac{16384}{1000} \approx 16\) hash slots.

Load Distribution:

With uniform hash function, load variance: $$\text{Var}[\text{load}] = \frac{\mu}{n \times v}$$

where:

Example: 1000 QPS across 1000 nodes with 16 virtual nodes each → standard deviation ≈ 25% of mean load.

Hot Partition Problem and Mitigation

Problem Definition:

A “celebrity user” generates 100x normal traffic:

Single Redis node cannot handle spike → becomes bottleneck.

Detection: Count-Min Sketch

Count-Min Sketch is a probabilistic data structure that tracks key frequencies in constant memory (~5KB for millions of keys) with O(1) operations. It provides conservative frequency estimates (never under-counts, may over-estimate), making it ideal for detecting hot keys without storing exact counters. Trade-off: tunable accuracy vs memory footprint.

Dynamic Hot Key Replication:

Goal: Prevent hot keys (e.g., celebrity users, viral content) from overwhelming a single cache node and creating bottlenecks.

Approach:

  1. Detection threshold: Configure the request rate that triggers replication

    • Too low = unnecessary replication overhead (memory waste across multiple nodes)
    • Too high = hot keys cause bottlenecks before mitigation kicks in
    • Determine based on single-node capacity and typical access patterns
  2. Replication factor selection: Choose how many replicas to create

    • Calculate: \(\text{replicas\_needed} = \lceil \frac{\text{hot\_key\_traffic}}{\text{single\_node\_capacity}} \rceil\)
    • Trade-off: More replicas = better load distribution but higher memory overhead
    • Consider network topology (replicate across availability zones for resilience)
  3. Load distribution: Spread reads across replicas

    • Random selection = simple, uniform distribution
    • Locality-aware = lower latency but more complex routing

How to determine values:

Workload Isolation: Separating Batch from Serving Traffic

One critical lesson from large-scale systems: never let batch workloads interfere with serving traffic.

The Problem:

Hourly batch jobs updating user profiles in CockroachDB (millions of writes/hour) can interfere with serving layer reads for ad personalization. Without isolation, batch writes can:

Solution: Read/Write Replica Separation

Goal: Isolate batch write workloads from latency-sensitive serving reads to prevent I/O contention, queue buildup, and compaction-induced stalls.

Approach:

  1. Workload characterization: Measure your read/write ratio and latency requirements

    • Serving traffic: high-volume reads, strict latency SLAs (e.g., <20ms p99)
    • Batch jobs: bursty writes, throughput-focused, can tolerate higher latency
  2. Capacity allocation strategy: Dedicate infrastructure based on workload intensity

    • Calculate: \(\text{batch\_capacity} = \frac{\text{batch\_write\_throughput} \times \text{replication\_factor}}{\text{node\_write\_capacity}}\)
    • Calculate: \(\text{serving\_capacity} = \frac{\text{serving\_read\_throughput} \times \text{safety\_margin}}{\text{node\_read\_capacity}}\)
    • Trade-off: Over-provisioning batch capacity wastes resources; under-provisioning causes spillover that degrades serving latency
  3. Consistency vs staleness trade-off: Decide what staleness is acceptable for serving reads

    • Strong consistency = all reads hit the write leader (no isolation benefit, full contention)
    • Eventual consistency = reads from local replicas (isolation achieved, but data may be slightly stale)
    • Determine staleness tolerance based on business requirements (user profiles can tolerate seconds of lag, financial data may require strong consistency)
  4. Topology design: Pin workloads to specific regions/nodes

    • Use database-specific primitives (range leases, follower reads, read replicas)
    • Concentrate batch writes on dedicated infrastructure
    • Serve reads from separate replicas that aren’t absorbing write load

How to determine capacity split:

Cost of isolation: You’re essentially paying for separate infrastructure to prevent contention. The cost is proportional to your batch workload intensity. If batch jobs consume 30% of total database operations, expect to provision roughly 30-40% additional capacity for isolation (accounting for replication overhead).

Monitoring the gap:

Track replication lag between batch and serving replicas:

$$\text{Replication lag} = Timestamp_{\text{serving replica}} - Timestamp_{\text{batch replica}}$$

If lag exceeds 5 minutes, you might have a problem. Scale the batch replica or throttle batch writes.

Cache Invalidation Strategies

Problem: When user data updates (e.g., profile change), how to invalidate stale cache?

Strategy 1: TTL-Based (Passive)

Set time-to-live on cache entries: $$\text{Staleness} \leq \text{TTL}$$

Pros:

Cons:

Strategy 2: Active Invalidation (Event-Driven)

On data update:

  1. Publish invalidation event to Kafka topic
  2. All cache servers subscribe and evict key from L1/L2

Latency:

Kafka publish latency: ~5ms Consumer processing: ~10ms Total invalidation propagation: ~15ms

Pros:

Cons:

Strategy 3: Versioned Caching

Include version in cache key:

On update:

  1. Increment version in metadata store
  2. New requests fetch new version
  3. Old version expires naturally via TTL

Pros:

Cons:

Hybrid Approach (Recommended):

Architectural Drivers: Latency vs Financial Accuracy - We use eventual consistency (30s TTL) for user preferences to meet latency targets, but strong consistency (active invalidation) for GDPR opt-outs where legal compliance is non-negotiable.


Privacy-Preserving Attribution: SKAdNetwork & Privacy Sandbox

Architectural Driver: Signal Availability - When 40-60% of traffic lacks stable user_id (ATT opt-out, Privacy Sandbox), traditional click-to-conversion attribution breaks. SKAdNetwork (iOS) and Attribution Reporting API (Chrome) provide privacy-preserving alternatives with delayed, aggregated conversion data.

The Attribution Challenge

Traditional attribution: User clicks ad → store user_id + click_id → user converts → match conversion to click via user_id → attribute revenue.

This fails when:

Privacy frameworks provide attribution without persistent identifiers:

SKAdNetwork Postback Handling (iOS)

Apple’s SKAdNetwork provides conversion data for ATT opt-out users through delayed postbacks. When a user clicks an ad and installs an app, iOS starts a privacy timer (24-72 hours, randomized). If the user converts within the app during this window, the app signals the conversion to SKAdNetwork. After the timer expires, Apple sends an aggregated postback to the ad network containing campaign-level attribution data.

Critical architectural constraints:

The postback contains only campaign identifier and a 6-bit conversion value (0-63) - no user identity, device ID, or precise conversion details. This forces a fundamentally different attribution model:

Data pipeline integration:

    
    graph TB
    SKAN[SKAdNetwork Postback
HTTPS webhook] KAFKA[Kafka Topic
skan-postbacks] FLINK[Flink Processor
Aggregate by campaign] CRDB[CockroachDB
campaign_conversions table] SKAN -->|Parse & validate| KAFKA KAFKA --> FLINK FLINK -->|campaign_id, conversion_value, count| CRDB style SKAN fill:#f9f,stroke:#333 style KAFKA fill:#ff9,stroke:#333 style FLINK fill:#9ff,stroke:#333 style CRDB fill:#9f9,stroke:#333

Storage and aggregation:

Postbacks arrive as HTTPS webhooks, get queued in Kafka for reliability, then aggregated by Flink into campaign-level conversion metrics. The database stores daily aggregates partitioned by date: campaign identifier, conversion value, postback count, and revenue estimates.

Conversion value interpretation:

Advertisers map the 64-bit conversion space to their business model. Common patterns include quartile-based revenue brackets (0-15 for trials/signups, 16-31 for small purchases, 32-47 for medium, 48-63 for high-value conversions) or subscription tier encoding. The mapping becomes a critical product decision since it defines what the ML models can optimize for.

Trade-offs accepted:

Privacy Sandbox Attribution Reporting API (Chrome)

Chrome’s Attribution Reporting API offers two distinct privacy models: event-level reports that link individual clicks to conversions with heavy noise (only 3 bits of conversion data, delayed 2-30 days), and aggregate reports that provide detailed conversion statistics across many users protected by differential privacy. The browser mediates all attribution, storing click events locally and generating reports after random delays to prevent timing attacks.

Integration approach:

Reports arrive at a dedicated endpoint, flow through the same Kafka-Flink-CockroachDB pipeline as SKAdNetwork postbacks, and aggregate into unified campaign-level metrics. This allows treating iOS and Chrome privacy-preserving attribution as a single conceptual layer despite different underlying mechanisms.

Maturity considerations:

Privacy Sandbox is evolving through 2024/2025. Attribution Reporting API is in origin trials (pre-production testing), Topics API is already integrated for contextual interest signals (Part 1), and Protected Audience API (formerly FLEDGE) for on-device auctions remains on the roadmap. The architecture must accommodate API changes as specifications stabilize.

Operational impact:

Attribution MethodCoverageLatencyGranularityRevenue Performance
Traditional (cookie/IDFA)40-60% (declining)Real-timeUser-level100% baseline
SKAdNetworkiOS opt-out users24-72 hoursCampaign-level60-70% of baseline
Privacy SandboxChrome users2-30 daysEvent-level (noised) or aggregate50-80% of baseline (evolving)
Contextual-onlyAll usersReal-timeRequest-level50-70% of baseline

Our approach:


Immutable Financial Audit Log: Compliance Architecture

The Compliance Gap

CockroachDB operational ledger is mutable by design - optimized for operational efficiency but violating financial compliance:

Regulatory violations:

Solution: Dual-Ledger Architecture

Separate operational concerns (performance) from compliance concerns (immutability) using distinct systems:

Operational Ledger (CockroachDB):

Immutable Audit Log (Kafka → ClickHouse):

    
    graph TB
    subgraph OPERATIONAL["Operational Systems (Real-Time)"]
        BUDGET[Budget Service
3ms latency] BILLING[Billing Service
Charges & Refunds] CRDB[(CockroachDB
Operational Ledger
Mutable
90-day retention)] end subgraph PIPELINE["Event Pipeline"] KAFKA[Kafka Topic
financial-events
30-day retention
3x replication] end subgraph AUDIT["Immutable Audit Log"] CH_KAFKA[ClickHouse
Kafka Engine Table] CH_MV[Materialized View
Transform JSON] CH_STORAGE[(ClickHouse
MergeTree Storage
Immutable
7-year retention
Hash chaining)] end subgraph QUERY["Query Interfaces"] RECON[Daily Reconciliation Job
Automated 2AM UTC] METABASE[Metabase Dashboard
Finance Team] SQL[SQL Client
External Auditors] EXPORT[Parquet Export
Quarterly Audits] end BUDGET -->|Async publish
non-blocking| KAFKA BILLING -->|Async publish
non-blocking| KAFKA BUDGET -->|Sync write
3ms| CRDB BILLING -->|Sync write
5ms| CRDB KAFKA -->|Real-time consume
5s lag| CH_KAFKA CH_KAFKA --> CH_MV CH_MV --> CH_STORAGE RECON -.->|Query operational| CRDB RECON -.->|Query audit| CH_STORAGE METABASE -.->|Ad-hoc queries| CH_STORAGE SQL -.->|Read-only access| CH_STORAGE EXPORT -.->|Quarterly extract| CH_STORAGE style BUDGET fill:#e3f2fd style BILLING fill:#e3f2fd style CRDB fill:#fff3e0 style KAFKA fill:#f3e5f5 style CH_STORAGE fill:#e8f5e9 style RECON fill:#ffebee

Event Pipeline Architecture

Event Flow: Budget Service and Billing Service emit structured financial events (budget deductions, impression charges, refunds, allocations) to Kafka financial-events topic asynchronously. Each event contains event type, campaign/advertiser IDs, amount, timestamp, and correlation IDs for traceability.

Kafka Buffer: Topic configured with 30-day retention (safety buffer during ClickHouse downtime), partitioned by campaignId for ordering guarantees, 3× replication for durability. Capacity: 100K events/sec (10% of platform QPS generating financial events).

ClickHouse Ingestion: Kafka Engine table consumes events directly, Materialized View transforms JSON into columnar schema optimized for analytics. MergeTree storage provides append-only immutability with automatic ZSTD compression (65% reduction). Ingestion lag: <5 seconds from event generation to queryable.

4. Audit Query Patterns

ClickHouse OLAP optimization enables sub-second queries for compliance scenarios:

Campaign Spend History (Tax Reporting): Aggregate all budget deductions for specific campaign over annual period. Common during tax filing season when advertisers request detailed spending breakdowns by campaign, geography, and time period. ClickHouse columnar storage and partition pruning enable sub-500ms queries across billions of events when filtering by campaign and time-range.

Dispute Investigation (Billing Accuracy): Trace complete event sequence for specific request ID when advertiser disputes charge. Requires chronological ordering of all events (budget deduction, impression charge, click attribution, refund if applicable) to reconstruct exact billing calculation. Bloom filter index on requestId enables <100ms single-request retrieval even across multi-year dataset.

Reconciliation Analysis (Data Integrity): Compare daily aggregate spend between operational ledger (CockroachDB) and audit log (ClickHouse) to detect discrepancies. Requires grouping by campaign with tolerance for rounding differences. ClickHouse materialized views pre-compute daily aggregates for instant reconciliation queries.

Compliance Audit Trail (SOX/Regulatory): External auditors query complete financial history for specific advertiser or time period. Requires filtering by advertiser ID, event type (budget allocations, deductions, refunds), and date range with multi-dimensional grouping. ClickHouse query performance remains sub-second for most audit scenarios due to partition pruning and columnar compression.

Query Access Control

Access Restriction Policy: Financial audit log is classified data with restricted access per Segregation of Duties (SOX compliance). Default access: NONE. Only designated roles below have explicit permissions:

Automated Systems:

Finance Team: Read-only Metabase access (SSO auth, 30s timeout, 100K row limit). Authorized queries: campaign spend trends, refund analysis, advertiser billing summaries, budget utilization reports. Handles all billing dispute investigations requiring financial data access.

External Auditors: Temporary credentials (expire post-audit) with pre-approved query templates for: annual tax reporting, SOX compliance verification, advertiser reconciliation. Complex queries scheduled off-peak. All auditor activity logged separately for compliance record.

Break-Glass Access: Emergency investigation (data corruption, critical billing bug) requires VP Finance + VP Engineering approval, limited to 1-hour window, full session recording, mandatory post-incident compliance review.

ClickHouse Storage Design

MergeTree Configuration: Ordering key (campaignId, timestamp) optimizes campaign history queries. Monthly partitioning toYYYYMM(timestamp) enables efficient pruning for tax/annual reports. ZSTD compression achieves 65% reduction (200GB/day → 70GB/day). Bloom filter index on requestId enables <100ms dispute lookups.

Immutability Enforcement: MergeTree prohibits UPDATE/DELETE operations by design. Administrative changes require explicit ALTER TABLE DROP PARTITION (logged separately). Each row includes SHA-256 previousHash creating tamper-evident chain - modification breaks hash sequence, detected during quarterly verification.

Performance & Cost: Asynchronous write path (1-2ms Kafka publish, <5s ingestion lag) has zero operational latency impact. Query performance: <500ms simple aggregations, 1-3s complex analytics, <100ms dispute lookups. Storage: 180TB for 7-year retention (70GB/day × 2,555 days), approximately 15-20% of database infrastructure cost. Sub-second queries over billions of rows via columnar OLAP optimization.

Daily Reconciliation Process

Automated verification ensuring operational and audit ledgers remain synchronized. This process validates data integrity and detects system issues before they compound into billing disputes.

Reconciliation Job (Airflow DAG, scheduled 2:00 AM UTC daily):

Step 1: Extract Daily Aggregates from Both Systems

Query operational ledger (CockroachDB) and audit log (ClickHouse) for previous 24 hours, aggregating spend per campaign. Operational ledger contains real-time mutable data (90-day retention), while audit log contains immutable append-only events (7-year retention). Aggregation groups by campaign ID, summing budget deductions and impression charges while excluding refunds (handled separately).

Step 2: Compare Aggregates with Tolerance

Per-campaign validation accepts minor differences due to rounding and microsecond-level timing variations. Match tolerance set at 1 cent OR 0.001% of campaign total (whichever greater). For example, campaign with 10,000 spend allows up to 10 cents variance, while small campaign with 5 spend allows 1 cent variance. This tolerance accounts for floating-point rounding in budget calculations and clock skew between systems.

Step 3: Alert on Significant Discrepancies

P1 PagerDuty alert triggered when campaign variance exceeds threshold. Alert includes: affected campaign IDs, operational vs audit totals, percentage variance, and trend analysis (has this campaign had previous mismatches?). Dashboard visualization shows aggregate delta across all campaigns, enabling quick identification of systemic issues (e.g., Kafka consumer lag affecting all campaigns vs isolated campaign-specific bug).

Step 4: Forensic Investigation

Drill-down analysis retrieves complete event sequence for mismatched campaign from both systems. Event correlation matches operational ledger entries with audit log events by request ID to identify missing events (operational wrote but Kafka publish failed), duplicate events (retry caused double-write), or timing mismatches (event arrived after reconciliation window). Most common root causes:

Step 5: Automated Resolution Tracking

Reconciliation job stores results in dedicated tracking table: campaign ID, discrepancy amount, detection timestamp, resolution status. Daily report summarizes: total campaigns reconciled, mismatch count, average variance, unresolved discrepancy age. Historical trend analysis detects degrading data quality (increasing mismatch rate signals systemic problem requiring investigation).

Historical Success Rate: 99.999%+ campaigns match daily (typically 0-3 discrepancies out of 10,000+ active campaigns). Most discrepancies resolve automatically within 24-48 hours as delayed Kafka events arrive. Only 1-2 cases per month require manual intervention (code bug fixes, schema corrections, or manual backfill with approval workflow).


Auction Mechanism Design

First-Price Auctions: Industry Standard for RTB

Since 2019, the programmatic advertising industry has standardized on first-price auctions for Real-Time Bidding (RTB) and display advertising. In a first-price auction, the winner pays their bid - not the second-highest bid.

Why First-Price Became Standard:

The industry shifted from second-price to first-price auctions to address transparency concerns and bid landscape visibility. Key drivers:

Auction Setup:

Effective Bid (eCPM - effective Cost Per Mille):

Advertisers use different pricing models - some pay per impression (CPM), others per click (CPC), others per conversion (CPA). To compare apples-to-apples, we convert all bids to eCPM: expected revenue per 1000 impressions.

The conversion formulas are as follows:

$$ \begin{array}{ll} \text{CPM bid:} & eCPM = CPM (direct) \\ \text{CPC bid:} & eCPM = CPC \times CTR \times 1000 \\ \text{CPA bid:} & eCPM = CPA \times conversion\_rate \times CTR \times 1000 \end{array} $$

This normalizes bids across pricing models: eCPM represents expected revenue per 1000 impressions, accounting for how likely users are to click.

Why this matters: A higher CPC bid with low CTR (5%) may earn less than a lower CPC bid with high CTR (15%). The platform maximizes revenue by selecting the highest eCPM, not highest raw bid.

Winner Selection:

$$w = \arg\max_{i \in [1,N]} \text{eCPM}_i$$

Price Determination (First-Price):

The winner pays their bid (not the second-highest bid):

$$p_w = b_w$$

This is fundamentally different from second-price auctions where winners paid just enough to beat the runner-up.

Example:

AdvertiserBidCTReCPMRank
AB_a0.10B_a × 0.10 × 1000 = 100 × B_a2
BB_b0.15B_b × 0.15 × 1000 = 150 × B_b1
CB_c0.05B_c × 0.05 × 1000 = 50 × B_c3

Winner: Advertiser B (highest eCPM multiplier: 150× vs 100× vs 50×)

Price paid by B in first-price auction: $$p_B = b_B = B_b$$

Advertiser B pays their full bid amount.

Comparison: Second-Price vs First-Price

In a second-price auction (historical approach), Advertiser B would have paid just enough to beat A’s eCPM (by a small increment). In first-price, they pay their full bid.

The Bid Shading Response:

First-price auctions incentivize bid shading - DSPs use machine learning to predict the minimum bid needed to win and bid slightly above that. This recovers much of the economic efficiency of second-price auctions while maintaining transparency. (See “Bid Shading in First-Price Auctions” section below for details.)

Quality Score and Ad Rank

Ads are ranked by eCPM = bid × CTR, but in practice ad quality also matters for user experience.

The Quality Problem:

Consider two advertisers:

Should Y win just because they bid more? This degrades user experience.

Google’s Solution: Quality Score

Since ~2005, Google Ads has incorporated Quality Score into auction ranking:

$$\text{Ad Rank} = \text{Bid} \times \text{Quality Score}$$

Quality Score Components (1-10 scale):

Google evaluates three components, though exact weights are not publicly disclosed:

  1. Expected CTR (highest impact): Historical click-through rate for this keyword/ad combination
  2. Landing Page Experience (highest impact): Page load speed, mobile-friendliness, content relevance, security (HTTPS)
  3. Ad Relevance (moderate impact): How well ad text matches search query intent

Note: Research shows improving CTR or Landing Page Experience has roughly twice the impact of improving Ad Relevance. Focus optimization efforts on the top two components.

Modified Auction Ranking:

Instead of ranking by eCPM alone, rank by Ad Rank:

$$\text{Ad Rank}_i = b_i \times \text{CTR}_i \times \text{QualityScore}_i \times 1000$$

Example: Quality Beats Price

AdvertiserBidCTRQuality ScoreAd RankPositionWinner?
XB_low0.1510/10 (excellent)Quality-adjusted eCPM_high1Yes
YB_high (40% higher)0.156/10 (poor landing page)Quality-adjusted eCPM_lower2No

Advertiser X wins despite lower raw bid because of higher quality (10/10 vs 6/10).

System Design Implications:

1. Data Pipeline Requirements:

2. Computation Architecture:

Quality Score is computed offline by ML model, cached, and served at auction time:

    
    graph
    subgraph "Offline Pipeline - Runs Daily/Weekly"
        direction BT
        CACHE_WRITE[Cache Update
Redis/Memcached
Atomic Swap] PREDICT[Quality Score Prediction
All Advertiser-Keyword Pairs
Millions of Combinations] TRAIN[ML Model Training
XGBoost/Neural Net
Hours of Batch Processing] HD[(Historical Data Store
Time-Series DB
Billions of Auction Events)] HD --> TRAIN TRAIN --> PREDICT PREDICT --> CACHE_WRITE end subgraph "Online Pipeline - Real-Time <100ms" direction TB AUCTION[Auction Request
User Query + Bids
N Advertisers] CACHE_LOOKUP{Cache Lookup
Redis Read
< 1ms} CACHE_HIT[Quality Score Retrieved
99%+ Hit Rate] CACHE_MISS[Cache Miss
Use Default Score = 7/10
< 1% Rate] COMPUTE[Compute Ad Rank
Bid × CTR × QualityScore
< 1ms] FIRST_PRICE[First-Price Auction
Rank & Select Winner
< 5ms] RESULT[Auction Result
Winner + Price
Click/Impression Event] AUCTION --> CACHE_LOOKUP CACHE_LOOKUP -->|Hit| CACHE_HIT CACHE_LOOKUP -->|Miss| CACHE_MISS CACHE_HIT --> COMPUTE CACHE_MISS --> COMPUTE COMPUTE --> FIRST_PRICE FIRST_PRICE --> RESULT end style HD fill:#e1f5ff style TRAIN fill:#e1f5ff style PREDICT fill:#e1f5ff style CACHE_WRITE fill:#e1f5ff style AUCTION fill:#fff4e1 style CACHE_LOOKUP fill:#fffacd style CACHE_HIT fill:#d4edda style CACHE_MISS fill:#f8d7da style COMPUTE fill:#fff4e1 style FIRST_PRICE fill:#fff4e1 style RESULT fill:#fff4e1

3. Performance Considerations:

4. ML Model Deployment:

Relationship to First-Price Auctions:

Quality-adjusted first-price auctions work the same way:

The quality score affects ranking (who wins) but not the fundamental pricing (winner pays bid). This encourages advertisers to improve landing pages, ad relevance, and user experience to achieve better ad positions at lower bids.

Computational Complexity

First-Price Auction Complexity:

For \(N = 50\) DSPs:

Latency Impact:

At 5ms budget for auction logic:

Implementation Note: First-price auctions are computationally identical to second-price auctions (both O(N log N)). The difference is purely in pricing: first-price returns the winner’s bid, while second-price calculates the minimum bid needed to beat the runner-up.

Reserve Prices and Floor Prices

The Problem:

Without a reserve price (minimum bid), your auction might sell ad slots for very low prices when competition is low. Consider a scenario where only one advertiser bids far below market value for a premium slot - you’d rather show a house ad (promoting your own content) than sell it that cheaply.

What is a Reserve Price?

A reserve price \(r\) is the minimum eCPM required to participate in the auction. If no bids exceed \(r\), the impression is not sold (or filled with a house ad).

The Revenue Trade-off:

Setting the reserve price is a balancing act:

Reserve PriceWhat HappensExample
Too low
(0.25× market rate)
Sell almost all impressions, but accept low-value bids95% fill rate × low avg eCPM = suboptimal revenue
Optimal
(market rate)
Balance between fill rate and price70% fill rate × good avg eCPM = optimal revenue
Too high
(5× market rate)
Only premium bids qualify, but most impressions go unsold20% fill rate × high avg eCPM = suboptimal revenue

Mathematical Formulation:

Expected revenue per impression with reserve price \(r\):

$$\text{Revenue}(r) = r \times P(\text{bid} \geq r)$$

where \(P(\text{bid} \geq r)\) is the probability that at least one bid exceeds the reserve.

Optimal Reserve Price:

Find \(r^*\) that maximizes expected revenue. If bids follow a known distribution with CDF \(F(v)\):

$$r^* = \arg\max_r \left[ r \times (1 - F(r)) \right]$$

Interpretation:

Concrete Example:

Suppose historical bids range uniformly from zero to maximum bid B_max. What’s the optimal reserve?

For uniform distribution: \(P(\text{bid} \geq r) = 1 - \frac{r}{10}\)

Expected revenue: $$\text{Revenue}(r) = r \times \left(1 - \frac{r}{10}\right) = r - \frac{r^2}{10}$$

Maximize by taking derivative: $$\frac{d}{dr}\left(r - \frac{r^2}{10}\right) = 1 - \frac{2r}{10} = 0$$

$$r^* = \frac{B_{max}}{2}$$

Result: Optimal reserve is half the maximum bid value (when bids are uniformly distributed).

Practical Approach:

Rather than assuming a distribution, use empirical data:

Multi-Dimensional Reserve Prices:

In practice, reserve prices are often segmented by:

Implementation Note:

Reserve prices work identically in first-price and second-price auctions - they filter out bids below the threshold before ranking. The difference is only in what the winner pays (their bid vs. second-highest bid).

Bid Shading in First-Price Auctions

With first-price auctions, DSPs face a strategic problem: bidding true value guarantees zero profit (you pay exactly what the impression is worth to you). This creates the bid shading optimization problem.

The Bid Shading Problem:

In first-price auctions:

How Bid Shading Works:

DSPs use machine learning to predict the competitive landscape and bid strategically:

  1. Collect historical data: Track wins, losses, and winning prices across millions of auctions
  2. Build bid landscape model: For each impression context (user, publisher, time), predict:
    • Probability of winning at price \(p\): \(P(\text{win} | \text{bid} = p)\)
    • Distribution of competitor bids
  3. Optimize bid: Choose bid \(b\) that maximizes expected profit:

$$b^* = \arg\max_b \left[ (v - b) \times P(\text{win} | b) \right]$$

where \(v\) is the true value of the impression to the advertiser.

Example:

Suppose an advertiser values an impression at V_imp (based on predicted conversion rate). The bid landscape model predicts:

Optimal bid: 0.70 × V_imp (maximizes expected profit at 0.18 × V_imp per auction)

Why First-Price + Bid Shading ≈ Second-Price:

Bid shading recovers much of the economic efficiency of second-price auctions:

The small difference represents the DSP’s uncertainty about the competitive landscape. As bid landscape models improve, first-price with shading converges toward second-price revenue.

System Design Implications:

From the SSP (supply-side platform) perspective:

Implementation Note: SSPs don’t implement bid shading - that’s the DSP’s responsibility. The SSP simply runs a first-price auction (rank by eCPM, winner pays bid). The complexity of bid optimization happens on the demand side.

Historical Context: Second-Price Auctions

Before 2019, the programmatic advertising industry primarily used second-price auctions (specifically, Generalized Second-Price or GSP auctions). Understanding this history helps explain design decisions in legacy systems and why the industry shifted to first-price.

Why Second-Price Was Popular (2000s-2018):

  1. Theoretical elegance: Encouraged truthful bidding (in theory)
  2. Simpler for advertisers: “Bid your true value” was easier to explain than bid shading
  3. Google’s influence: Google Search Ads used GSP successfully, setting industry precedent
  4. Established ecosystem: Bidding algorithms optimized for second-price dynamics

How Second-Price (GSP) Works:

In a second-price auction, the winner pays just enough to beat the second-highest bidder:

$$p_w = \frac{\text{eCPM}_{2nd}}{\text{CTR}_w \times 1000} + \epsilon$$

where \(\epsilon\) is a small increment.

Example:

AdvertiserBidCTReCPMRank
AB_a0.10100 × B_a2
BB_b0.15150 × B_b1
CB_c0.0550 × B_c3

Winner: Advertiser B (highest eCPM multiplier: 150×)

Price paid by B in second-price: $$p_B = \frac{100 \times B_a}{0.15 \times 1000} = 0.67 \times B_a$$

Advertiser B only pays enough to beat A’s eCPM (not their full bid B_b).

Why the Industry Shifted to First-Price (2017-2019):

Several factors drove the migration:

  1. Header bidding transparency: Publishers could see all bids simultaneously, making second-price “bid reduction” visible and contentious
  2. Price floor manipulation: SSPs could manipulate second-price auctions by setting floors strategically
  3. Complexity: Second-price pricing logic was opaque (“Why did I pay less than my bid?”)
  4. DSP preference: Major DSPs (Google DV360, The Trade Desk) preferred first-price with their own bid shading
  5. Revenue impact: First-price with bid shading generates 5-15% higher revenue in practice (DSPs shade conservatively)

Timeline:

GSP Still Used for Sponsored Search:

Google Search Ads, Microsoft Ads, and Amazon Sponsored Products still use GSP (second-price) because:

Key Difference: Search vs. Display:

Auction TypeUsed ForPricing
GSP (Second-Price)Sponsored search (Google Search Ads)Winner pays second-highest + small increment
First-PriceProgrammatic display/video/CTV (RTB)Winner pays their bid

This blog focuses on first-price auctions because they are the modern standard for Real-Time Bidding (RTB) and programmatic display advertising - the architecture described in this document.


Budget Pacing: Distributed Spend Control

Architectural Driver: Financial Accuracy - Pre-allocation pattern with Redis atomic counters ensures budget consistency across regions. Max over-delivery bounded to 1% of daily budget (acceptable legal risk) while avoiding centralized bottleneck.

Problem: Advertisers set daily budgets (e.g., daily limit). In a distributed system serving 1M QPS, how do we prevent over-delivery without centralizing every spend decision?

Challenge:

Centralized approach (single database tracks spend):

Solution: Pre-Allocation with Periodic Reconciliation

    
    graph TD
    ADV[Advertiser X
Daily Budget: B_daily] ADV --> BUDGET[Atomic Pacing Service] BUDGET --> REDIS[(Redis
Atomic Counters)] BUDGET --> CRDB[(CockroachDB
Billing Ledger
HLC Timestamps)] BUDGET -->|Allocate amount_1| AS1[Ad Server 1] BUDGET -->|Allocate amount_2| AS2[Ad Server 2] BUDGET -->|Allocate amount_3| AS3[Ad Server 3] AS1 -->|Spent: S1
Return: unused_1| BUDGET AS2 -->|Spent: S2
Return: unused_2| BUDGET AS3 -->|Spent: S3
Return: unused_3| BUDGET BUDGET -->|Periodic reconciliation
HLC timestamped| CRDB TIMEOUT[Timeout Monitor
5min intervals] -.->|Release stale
allocations| REDIS REDIS -->|Budget < 10%| THROTTLE[Dynamic Throttle] THROTTLE -.->|Reduce allocation
size dynamically| BUDGET classDef server fill:#e3f2fd,stroke:#1976d2 classDef budget fill:#fff3e0,stroke:#f57c00 classDef advertiser fill:#e8f5e9,stroke:#4caf50 class AS1,AS2,AS3 server class BUDGET,REDIS,CRDB,TIMEOUT,THROTTLE budget class ADV advertiser

How it works:

  1. Atomic Pacing Service pre-allocates budget chunks to Ad Servers (variable allocation amounts)
  2. Ad Servers spend from local allocation using Redis atomic counters (no coordination needed)
  3. Periodic reconciliation (every 30 seconds): Ad Servers return unused budget to Atomic Pacing Service
  4. CockroachDB records all spend events with HLC (Hybrid Logical Clock) timestamps for globally ordered audit trail
  5. Timeout Monitor releases stale allocations after 5 minutes (handles server crashes)
  6. Dynamic Throttle reduces allocation size when budget < 10% remaining (prevents over-delivery)

Budget Allocation Operations:

Allocation request (Ad Server requests budget chunk):

Reconciliation (Ad Server returns unused budget):

Key Properties:

Mathematical Analysis:

Over-Delivery Bound:

Maximum over-delivery: $$\text{OverDelivery}_{max} = S \times A$$

where \(S\) = number of servers, \(A\) = allocation chunk size.

Example: 100 servers with allocation A each → max 100 × A over-delivery (10% of 1000 × A daily budget).

Mitigation: Dynamic allocation sizing.

When budget remaining drops below 10%: $$A_{new} = \frac{B_r}{S \times 10}$$

This reduces max over-delivery to ~1% of budget.

The Critical Path: Synchronous Budget Check (5ms)

The Missing Piece: The pre-allocation strategy above handles periodic budget allocation (every 30-60s), but doesn’t explain the per-request budget check that must happen on EVERY ad request at 1M QPS. This synchronous check is the critical path component that ensures financial accuracy while meeting the 150ms SLO.

The Challenge at 1M QPS:

Naive approach (query CockroachDB on every request):

Solution: Bounded Micro-Ledger (BML) Architecture

The BML system provides three-tier budget enforcement that achieves both low latency (3-5ms) and financial accuracy (bounded overspend):

Three-Tier BML Architecture (Critical Financial Atomicity Mechanism):

Tier 1: Synchronous Budget Check (Redis Lua Script - 3ms)

Tier 2: Asynchronous Delta Propagation (Redis → Kafka)

Tier 3: Reconciliation Processor (Flink/Kafka Streams → CockroachDB)

Why This Three-Tier Architecture is Required:

    
    graph TB
    subgraph "Synchronous Tier (3ms - Critical Path)"
        REQ[Ad Request
1M QPS] --> AUCTION[Auction Selects Winner
Ad from Campaign X
Cost: C] AUCTION --> BML_CHECK{BML: Atomic
Check & Deduct} BML_CHECK -->|Budget OK| REDIS_LUA[Redis Lua Script
ATOMIC:
if spend+cost < limit+bound
then deduct
Latency: 3ms] REDIS_LUA -->|SUCCESS| SERVE[Serve Ad
Revenue: C] BML_CHECK -->|Budget EXHAUSTED| NEXT[Try Next Bidder
or House Ad] end subgraph "Asynchronous Tier (Reconciliation)" REDIS_LUA -.->|Emit delta
every 5s| KAFKA[Kafka
Spend Events] KAFKA -.-> FLINK[Flink
Aggregate] FLINK -.->|Batch commit
every 30s| CRDB[(CockroachDB
Billing Ledger
Source of Truth)] end CRDB -.->|Periodic sync
every 60s| REDIS_LUA classDef critical fill:#ffcccc,stroke:#cc0000,stroke-width:2px classDef async fill:#ccffcc,stroke:#00cc00,stroke-dasharray: 5 5 class REQ,AUCTION,BML_CHECK,REDIS_LUA,SERVE critical class KAFKA,FLINK,CRDB async

Bounded Micro-Ledger (BML) Components:

1. Synchronous Tier: Redis Atomic Counter (3ms Budget)

Purpose: Fast, atomic check-and-deduct for every ad request

Atomic Check-and-Deduct Algorithm:

The algorithm executes atomically within Redis (single-threaded, no concurrent modifications possible):

Inputs:

Algorithm Steps:

  1. Read current state from Redis hash for this campaign:

    • current_spend: How much already spent today
    • budget_limit: Daily budget cap
  2. Calculate remaining budget:

    • remaining = budget_limit - current_spend
  3. Atomic decision: Check if spend is allowed

    • CRITICAL CONDITION (Key to ≤1% billing accuracy):
      current_spend + cost ≤ budget_limit + inaccuracy_bound
      
    • If TRUE (budget available):
      • Increment spend counter by cost atomically
      • Return SUCCESS with new remaining budget
    • If FALSE (budget exhausted):
      • Do NOT modify spend counter
      • Return BUDGET_EXHAUSTED with current remaining

Critical Design Property: The inaccuracy_bound parameter in the Lua script condition is the mathematical enforcement mechanism that guarantees ≤1% billing accuracy. By setting inaccuracy_bound = 0.01 × budget_limit, we ensure maximum overspend is bounded to 1% of daily budget. This is the ONLY way to achieve bounded financial accuracy while maintaining 3ms latency at 1M QPS.

Why This is Atomic:

Redis executes the entire algorithm as a single atomic operation (Lua script runs single-threaded). Even if 1,000 requests arrive simultaneously, Redis processes them serially one-at-a-time, guaranteeing no race conditions.

Key Properties:

2. Asynchronous Tier: Reconciliation to CockroachDB

Purpose: Periodic sync to source of truth for audit trail and accuracy

Reconciliation Process (Flink Stream Processing Job, runs every 30s):

Step 1: Aggregate Spending Deltas

Step 2: Batch Commit to CockroachDB

Step 3: Sync Redis from Source of Truth (every 60s)

Why Two-Tier Works:

3. Integration with Request Flow

The budget check sits in the Auction Logic phase:

Before:

After (with BML):

Updated Request Flow Timing:

Complete request path latency breakdown:

ComponentLatencyNotes
Network + Gateway15ms
User Profile10ms
Integrity Check5msFraud detection (BEFORE RTB)
Feature Store10ms
Ad Selection15ms
ML Inference40ms(parallel execution)
RTB Auction100ms(parallel, critical path)
Auction + Budget Check8msBudget enforcement
Response5ms
Total143ms(5-7ms buffer to 150ms SLO)

Mathematical Proof: Bounded Overspend of $5 per Campaign

Theorem: Maximum overspend per campaign is bounded to the inaccuracy_bound value ($5).

Proof:

Define:

The Lua script allows spend if: $$S(t) + c_i \leq B + \Delta$$

Worst case scenario: Multiple concurrent requests hit Redis simultaneously before the spend counter updates.

Maximum concurrent overshoot:

At most \(\Delta\) dollars can be spent beyond the limit because:

  1. Once \(S(t) > B\), the Lua script rejects ALL future requests
  2. The maximum “in-flight” spend that can sneak through is bounded by \(\Delta\)
  3. Even if 1000 requests arrive at the exact same nanosecond, Redis executes Lua scripts serially

Mathematical upper bound:

$$\text{Total Spend} \leq B + \Delta$$

$$\text{Overspend} = \max(0, \text{Total Spend} - B) \leq \Delta = \$5$$

Practical example:

Campaign has $1000 daily budget with $5 inaccuracy bound:

Alternative Explanation: In-Flight Requests Model

The inaccuracy_bound parameter ($5) can also be derived from system characteristics rather than configured arbitrarily. This approach calculates the bound based on request latency and throughput.

Parameters:

In-flight requests calculation:

When a budget counter hits zero, there are already requests in-flight that checked the budget as “available”:

$$R_{inflight} = Q_{campaign} \times T_{req} = 1,000 \times 0.15 = 150 \text{ requests}$$

Maximum overspend from in-flight requests:

If all in-flight requests complete (worst case):

$$Overspend_{max} = R_{inflight} \times L = 150 \times \$0.005 = \$0.75$$

Connecting both models:

The inaccuracy_bound parameter ($5) provides 10× safety margin over the calculated in-flight overspend ($0.75):

Both models are valid:

For typical campaigns ($1,000-$10,000 daily budgets), both approaches yield overspend ≤0.5%, meeting the ≤1% financial accuracy requirement.

4. Handling Reconciliation Drift

Problem: Redis counter drifts from CockroachDB source of truth due to:

Solution: Periodic Sync Procedure (runs every 60s):

Algorithm:

  1. Query Source of Truth

    • For each active campaign, query CockroachDB billing ledger
    • Compute true cumulative spend: SUM(spend) WHERE campaign_id = X
    • This is the authoritative value (immutable audit trail)
  2. Update Redis Cache

    • Write true spend value to Redis hash for this campaign
    • Overwrite any stale or drifted value
    • Redis now reflects accurate state from source of truth
  3. Detect and Alert on Drift

    • Read current Redis value before overwriting
    • Calculate drift: |true_spend - redis_spend|
    • If drift exceeds threshold ($50):
      • Alert operations team via PagerDuty
      • Log discrepancy for investigation
      • Common causes: Redis node restart, delayed reconciliation, split-brain scenario

Why Drift Happens:

Why Drift is Acceptable:

Failure Mode: Tier 3 Reconciliation Outage

If Flink job or Kafka become unavailable:

  1. Tier 1 continues operating: Budget checks work normally (Redis is independent)
  2. Impact: Audit trail writing to CockroachDB is paused
  3. Detection: Periodic sync (60s) detects drift > $50, alerts operations team via PagerDuty
  4. Recovery: When Flink recovers, processes backlog from Kafka (Kafka retains events for 7 days)
  5. Maximum data loss: None - Kafka retention ensures event replay capability
  6. Bounded risk: Redis continues enforcing spend limits, preventing unbounded overspend

This failure mode demonstrates graceful degradation: critical path (Tier 1) remains operational while audit trail temporarily lags. Financial accuracy is maintained via bounded inaccuracy, audit completeness is recovered via Kafka replay.

Why This Works at 1M QPS:

  1. Sharding: Redis cluster sharded by campaign_id (100+ shards)
  2. Per-shard throughput: 10K QPS per shard (well within Redis capacity)
  3. Latency: Lua script execution: 1-3ms, network RTT: 1-2ms = 3-5ms total
  4. Bounded inaccuracy: $5 overspend is legally acceptable (0.05-0.5% of typical campaign budgets)

Why CockroachDB Alone Doesn’t Work:

Trade-offs:

ApproachLatencyAccuracyCostScalability
CockroachDB only10-15ms (slow)PerfectHighLimited
Redis only5msBounded ($5)LowExcellent
BML (both tiers)5msBounded + auditedMediumExcellent

Idempotency Protection: Defending Against Double-Debits (CRITICAL)

The Problem: Double-Debit Risk

The BML architecture above handles budget enforcement correctly during normal operation, but lacks defense against a critical failure scenario: message replay after service crashes.

Failure scenario:

  1. Ad Server Orchestrator (AS) processes ad request, runs auction, selects winning ad
  2. AS calls Atomic Pacing Service → Redis Lua script successfully debits campaign budget
  3. AS crashes before sending response to client (network issue, pod restart, out-of-memory)
  4. Client doesn’t receive response, retries the same request (standard retry behavior)
  5. AS processes retry, runs auction again, debits budget AGAIN (double-debit for single impression)
  6. Result: Double-debit violates ≤1% billing accuracy constraint

Why this violates financial integrity:

Solution: Idempotency Key Store

The Atomic Pacing Service must implement idempotent budget deductions using a Redis-backed idempotency key mechanism.

Architecture:

    
    graph TB
    REQ[Ad Request
client_request_id: abc123] --> AS[Ad Server Orchestrator] AS --> GEN[Generate Idempotency Key
UUID + Timestamp
Key: idem:campaign_X:abc123] GEN --> LUA[Redis Lua Script
Atomic Check-and-Set] LUA --> CHECK{Key exists?} CHECK -->|YES| CACHED[Return cached result
DEDUP: Budget NOT debited
Return previous debit amount] CHECK -->|NO| DEBIT[Debit budget: -$2.50
Store key with TTL=30s
Value: debit_amount=$2.50] CACHED --> RESP1[Return to client
Idempotent response] DEBIT --> RESP2[Return to client
Fresh debit] TTL[TTL Expiration
After 30 seconds] -.->|Auto-delete key| CLEANUP[Key removed
Prevents memory leak] style CHECK fill:#fff3e0,stroke:#f57c00 style CACHED fill:#c8e6c9,stroke:#4caf50 style DEBIT fill:#ffccbc,stroke:#ff5722

Implementation: Enhanced Redis Lua Script

The Lua script must perform atomic check-and-set to guarantee exactly-once semantics:

Enhanced Lua script logic:

The script performs atomic check-and-set operations in a single Redis transaction:

  1. Check idempotency key: GET operation on the idempotency key
  2. If key exists: Return cached result (deduplication - budget was already debited)
    • Signals to caller: deduplicated=true, returns previous debit amount
    • Critical: Budget is NOT debited again (exactly-once guarantee)
  3. If key doesn’t exist:
    • Check budget: current_spend + cost <= budget_limit + inaccuracy_bound
    • If budget OK: Debit budget AND store idempotency key atomically
      • DECRBY operation: Deduct cost from budget counter
      • SETEX operation: Store idempotency key with TTL (30 seconds)
      • Key value contains: debit amount, timestamp, transaction metadata
    • If budget exhausted: Return error (no debit, no key stored)

Idempotency Key Naming Convention:

Keys follow a hierarchical pattern for efficient sharding and collision prevention:

Pattern: idem:campaign_{campaign_id}:{client_request_id}_{timestamp_bucket}

Components:

Example: idem:campaign_12345:req_abc123_1704067200

Why this format works:

TTL Rationale (30 seconds):

Memory overhead:

Why Lua Script is Critical:

Redis Lua scripts provide atomic execution guarantee - the foundation of idempotency protection.

Without Lua (separate GET + DECRBY operations), race conditions are inevitable:

  1. Thread A: GET key → not found
  2. Thread B: GET key → not found (race window - both threads see “not found”)
  3. Thread A: DECRBY budget
  4. Thread B: DECRBY budget (double-debit! - both threads deduct)

Lua script runs single-threaded in Redis, eliminating race conditions:

Client-Side Requirements:

Idempotency requires client cooperation - the contract between client and server:

  1. Generate stable request IDs: Client must use consistent ID for retries

    • Use UUID v4 generated once per original request (industry standard: Stripe, PayPal, AWS use this pattern)
    • Include in retry attempts: same request_id for all retries of original request
    • Why stable IDs matter: Different ID on retry = treated as new request = double-debit
  2. Include request ID in API call:

    • HTTP header (recommended): X-Request-ID: abc123-def456-ghi789 (RFC 7231 standard)
    • Or request body: request_id field in JSON payload
    • Server must validate: Reject requests with missing/malformed IDs in strict mode
  3. Retry policy with exponential backoff (prevents thundering herd):

    • 1st retry: 100ms + random jitter (0-50ms)
    • 2nd retry: 500ms + random jitter (0-250ms)
    • 3rd retry: 2s + random jitter (0-1s)
    • Max retries: 3 (total window: ~3s, well within 30s TTL)
    • Jitter prevents: Synchronized retries from multiple clients overwhelming server

Edge Cases and Failure Modes:

Real-world systems must handle imperfect clients and infrastructure failures:

Case 1: Client doesn’t provide request_id (Legacy client or API misuse)

Case 2: Redis key expires during retry window (Timing edge case)

Case 3: Redis unavailable (Failover scenario)

Monitoring:

Track idempotency metrics:

Alerts:

Industry Comparison: How This Matches Best Practices

Our idempotency design aligns with proven patterns from leading payment and financial platforms:

AspectOur DesignStripeAWSPayPalIndustry Best Practice
Request ID SourceClient-generated UUIDClient-generated UUIDClient-generated UUIDClient-generated UUIDClient-controlled
ID HeaderX-Request-IDIdempotency-Keyx-amz-idempotency-tokenCustom headerHTTP header
StorageRedis (30s TTL)Database (24h TTL)DynamoDB (1h TTL)Database (24h)Persistent store with TTL
AtomicityLua scriptDatabase transactionDynamoDB ConditionExpressionDatabase transactionAtomic check-and-set
ScopePer campaignPer API keyPer request typePer merchantScoped to prevent conflicts
Retry behaviorReturn cached resultReturn cached result (HTTP 200)Return cached resultReturn cached resultIdempotent response
TTL rationale30s (high-frequency)24h (low-frequency)1h (moderate)24h (low-frequency)Context-dependent

Why our TTL differs (30s vs industry’s 24h):

Alternative approaches considered:

  1. Database-backed idempotency (Stripe’s approach)

    • Pros: Longer TTL (24h+), stronger durability guarantees
    • Cons: 10-15ms latency (violates our 5ms budget), poor scalability at 1M QPS
    • Decision: Rejected - latency unacceptable for critical path
  2. DynamoDB with conditional writes (AWS approach)

    • Pros: Managed service, strong consistency, regional replication
    • Cons: 8ms p99 latency (vs Redis 3ms), higher cost ($1000/month vs Redis $200/month)
    • Decision: Rejected - Redis already deployed for budget counters, reuse existing infrastructure
  3. In-memory only (no persistence) (Dangerous pattern)

    • Pros: Ultra-low latency (<1ms)
    • Cons: Lost on server restart, no failover protection
    • Decision: Rejected - violates financial integrity requirements

Why Redis + Lua is optimal for our use case:

Impact Assessment:

Without idempotency protection:

With idempotency protection:

ROI: Infinite - The implementation cost (minimal Redis overhead) is negligible compared to preventing systematic financial integrity violations that could result in platform-wide advertiser churn and regulatory liability.

Conclusion:

The Bounded Micro-Ledger architecture achieves the “impossible trinity” of:

  1. Low latency (5ms budget check)
  2. Financial accuracy (mathematically proven $5 max overspend + idempotency protection against double-debits)
  3. High throughput (1M+ QPS)

Critical addition: Idempotency protection is non-negotiable for production deployment. Without it, the system violates financial integrity guarantees during routine failure scenarios (crashes, retries, network issues).

This is the only viable architecture for real-time budget pacing at scale while maintaining financial integrity.

Summary: Data Consistency Meets Revenue Optimization

This post explored the three critical data systems that enable real-time ad platforms to serve 1M+ QPS with sub-150ms latency while maintaining financial accuracy: distributed caching for fast reads, eCPM-based auctions for fair price comparison, and atomic budget control for spend accuracy.

Three Critical Systems:

1. Distributed Caching Architecture

Problem: Serve 1M QPS without overwhelming databases Solution: Two-tier cache architecture with database fallback

LayerTechnologyLatencyUse CaseCache Hit Rate
L1 CacheCaffeine (in-process)0.001msHot user profiles60%
L2 CacheValkey (distributed)5msWarm data, feature vectors25%
DatabaseCockroachDB20msSource of truth (cache miss)15% of requests

Key decisions:

Performance impact:

2. Auction Mechanism Design

Problem: Compare $10 CPM bid with $0.50 CPC bid - which is worth more?
Solution: eCPM normalization using CTR prediction

eCPM formula:

$$ \begin{array}{ll} \text{CPM bid:} & eCPM = \text{CPM (direct)}\\ \text{CPC bid:} & eCPM = \text{CPC} \times \text{CTR} \times 1000 \\ \text{CPA bid:} & eCPM = \text{CPA} \times conversion_{rate} \times \text{CTR} \times 1000 \end{array} $$

Example:

Auction type decision: First-Price

Latency: 3ms for auction logic (ranking + budget check excluded)

3. Budget Pacing: Bounded Micro-Ledger

Problem: Prevent budget overspend across 300 distributed ad servers without centralizing every spend decision

Solution: Bounded Micro-Ledger with Redis atomic counters (detailed in Budget Pacing: Distributed Spend Control)

Core Architecture:

  1. Pre-allocation: Daily budget → allocate proportional hourly amounts to Redis counters
  2. Atomic deduction: DECRBY campaign:123:budget <cost> (5ms p99)
  3. Idempotency: Redis cache of request IDs prevents double-debits during retries
  4. Reconciliation: Every 10min, compare Redis totals vs CockroachDB source of truth
  5. Bounded overspend: Mathematical guarantee ≤0.1% per campaign (≤1% aggregate)

Why this works:

Performance Impact:

MetricWithout Budget PacingWith Bounded Micro-LedgerImprovement
LatencyCentralized DB check (50-100ms)Redis atomic counters (3ms avg, 5ms p99)17-30× faster
OverspendUnbounded (race conditions)≤0.1% per campaign (mathematical guarantee)Bounded
AvailabilitySingle point of failureDistributed Redis (multi-region)No bottleneck

Key Trade-offs:

See detailed implementation for Lua scripts, reconciliation algorithms, idempotency protection, and mathematical proofs.


Back to top