Caching, Auctions & Budget Control: Revenue Optimization at Scale

26 October 2025 • Yuriy Polyulya •

#distributed-systems #caching #auction-mechanisms #ads-tech

Introduction: Where Data Meets Revenue

Real-time ad platforms operate under extreme constraints: serve 1M+ queries per second, respond in under 150ms, run ML inference and external auctions, and maintain perfect financial accuracy. The revenue engine (RTB + ML inference) generates the bids, but three critical data systems determine whether the platform succeeds or fails:

The three data challenges that make or break ad platforms:

Cache performance: Can we serve 1M QPS without overwhelming the database?
- Problem: Database reads take 40-60ms. At 1M QPS, that’s 40-60K concurrent DB connections.
- Constraint: Only 10ms latency budget for user profile and feature lookups
- Solution needed: Multi-tier caching with 85%+ cache hit rate (only 15% query database)
Auction fairness: How do we compare CPM bid with CPC bid - which is worth more?
- Problem: Different pricing models (CPM/CPC/CPA) aren’t directly comparable
- Constraint: Must rank all ads fairly to maximize revenue
- Solution needed: eCPM normalization using predicted CTR
Budget accuracy: How do we prevent overspend across 300 distributed ad servers?
- Problem: Each server independently serves ads, but budgets must be enforced globally
- Constraint: Can’t centralize every spend decision (creates bottleneck + latency)
- Solution needed: Distributed atomic counters with proven accuracy bounds

Why these systems are interdependent:

Every ad request follows this critical path:

User profile lookup (10ms budget) → ML features → CTR prediction
ML features lookup (10ms budget) → CTR prediction → eCPM calculation
Auction logic (3ms budget) → rank all ads by eCPM → select winner
Budget check (3ms budget) → atomic deduction → confirm spend allowed

Miss any of these and revenue suffers:

Slow caching (>10ms) → violate latency budget → timeouts → blank ads
Unfair auctions → suboptimal ad selection → leave 15-25% revenue on table
Budget overspend → advertiser complaints → legal liability → platform shutdown

What this post covers:

This post builds the three data systems that enable revenue optimization:

Distributed Caching Architecture - L1/L2 cache tiers with intelligent invalidation strategies. Achieving 85% cache hit rate with 4.25ms average latency (only 15% requests query database). Technology choices: Caffeine (L1 in-process), Valkey (L2 distributed), CockroachDB (persistent database). Trade-offs between consistency, latency, and cost.
Auction Mechanism Design - eCPM normalization for fair comparison across CPM/CPC/CPA pricing models. First-price vs second-price auction analysis. Why first-price auctions won in modern programmatic advertising (2017-2019 industry shift). How predicted CTR converts CPC bids into comparable eCPM for ranking.
Distributed Budget Pacing - Bounded Micro-Ledger architecture using Redis atomic counters (DECRBY). Mathematical proof of ≤1% budget overspend guarantee. Why idempotency protection is non-negotiable for financial integrity. Pre-allocation pattern that eliminates centralized bottleneck while maintaining accuracy.

Broader applicability:

These patterns - multi-tier caching, fair comparison across heterogeneous inputs, distributed atomic operations with bounded error - apply beyond ad tech. High-throughput systems with strict latency budgets and financial accuracy requirements face similar challenges:

E-commerce inventory management (prevent overselling)
Trading platforms (fair order execution across order types)
Rate limiting systems (distributed quota enforcement)
Gaming platforms (virtual currency spend control)

The core insight is how these three systems integrate to deliver both speed (sub-10ms data access) and accuracy (≤1% financial variance) at massive scale (1M+ QPS).

Let’s explore how each system is designed and how they work together.

Distributed Caching Architecture

Multi-Tier Cache Hierarchy

To achieve high cache hit rates with sub-10ms latency, implement two cache tiers plus database (target: 85% cache hit rate avoiding database queries, with 25% L2 coverage):

Technology Selection: Cache Tier Choices

L1 Cache Options:

Technology	Latency	Throughput	Memory	Pros	Cons
Caffeine (JVM)	~1μs	10M ops/sec	In-heap	Window TinyLFU eviction, lock-free reads	JVM-only, GC pressure
Guava Cache	~1.5μs	5M ops/sec	In-heap	Simple API, widely used	LRU only, lower hit rate
Ehcache	~1.5μs	8M ops/sec	In/off-heap	Off-heap option reduces GC	More complex configuration

Decision: Caffeine - Superior eviction algorithm (Window TinyLFU) yields 10-15% higher hit rates than LRU-based alternatives. Benchmarks show ~2x throughput vs. Guava.

L2 Cache: Redis vs Memcached

The L2 cache choice came down to one requirement: atomic operations for budget counters. Memcached is faster (3ms vs 5ms p99) and cheaper (~30% less memory), but it can’t do DECRBY/INCRBY atomically. Without atomic operations, budget counters would have race conditions - multiple servers could allocate from stale budget values, causing unbounded over-delivery.

Redis also gives us:

Rich data structures (sorted sets for ad recency, hashes for attributes)
Persistence for crash recovery (avoids cold cache startup)
Lua scripting for complex operations

The 30% memory premium over Memcached is worth it to avoid budget race conditions. Hazelcast (8ms latency) was too slow to consider seriously.

Valkey Alternative (Redis Fork):

In 2024, Redis Labs changed licensing from BSD to dual-license (SSPL + proprietary), creating uncertainty for commercial users. The Linux Foundation forked Redis into Valkey with permissive BSD-3 license:

API-compatible: Drop-in replacement for Redis
Clear licensing: BSD-3 (no SSPL restrictions)
Industry backing: AWS, Google Cloud, Oracle backing Linux Foundation project
Migration path: AWS ElastiCache transitioning to Valkey

Recommendation: Use Valkey for new deployments to avoid licensing ambiguity. Migration from Redis is trivial (same protocol, same commands, same performance).

L3 Persistent Store Options:

Note: Write throughput numbers reflect cluster-level performance at production scale (20-80 nodes for distributed databases). Single-node performance is 5-20K writes/sec (SSD RAID10, 32GB RAM) depending on workload characteristics.

Technology	Read Latency (p99)	Write Throughput (cluster-level)	Scalability	Consistency	Cross-Region ACID	HLC Built-in	Pros	Cons
CockroachDB	10-15ms	400K writes/sec (60-80 nodes)	Horizontal (Raft)	Serializable	Yes	Yes	SQL, JOINs, multi-region transactions	Operational complexity (self-hosted)
YugabyteDB	10-15ms	400K writes/sec (60-80 nodes)	Horizontal (Raft)	Serializable	Yes	Yes	PostgreSQL-compatible	Smaller ecosystem
Cassandra	20ms	500K writes/sec (100+ nodes)	Linear (peer-to-peer)	Tunable (eventual)	No	No	Multi-DC, mature	No JOINs, eventual consistency
PostgreSQL	15ms	50K writes/sec (single node)	Vertical + sharding	ACID	No	No	SQL, JOINs, strong consistency	Manual sharding complex
DynamoDB	10ms	1M writes/sec (auto-scaled)	Fully managed	Strong per-region MRSC (2024)	No	No	Auto-scaling, fully managed	No cross-region transactions, no JOINs, NoSQL limitations

Why CockroachDB

The persistent store must handle 400M user profiles (4TB+) with strong consistency for billing data. While Cassandra offers higher write throughput (500K vs 400K writes/sec) and battle-tested scale, eventual consistency is problematic for financial data and would require custom HLC implementation, reconciliation logic, and auditor explanations.

CockroachDB advantages:

Serializable ACID transactions (financial accuracy requirement)
Built-in HLC for timestamp ordering across regions
Multi-region geo-partitioning with quorum writes
Full SQL + JOINs (vs learning CQL)
Better read latency: 10-15ms vs Cassandra’s 20ms

Why Not DynamoDB?

Despite being fully managed and highly scalable, DynamoDB lacks critical features for our financial accuracy requirements:

No cross-region ACID transactions: DynamoDB’s 2024 MRSC feature provides strong consistency for reads/writes within each region, but transactions (TransactWriteItems) only work within a single region. Budget enforcement requires atomic operations across user profiles + campaign ledger + audit log - this cannot be guaranteed across regions.
No HLC or causal ordering: DynamoDB uses “last writer wins” conflict resolution based on internal timestamps. Without HLC, we can’t guarantee causal ordering across regions for financial audit trails. Example failure: Budget update in us-east-1 and spend deduction in eu-west-1 arrive out-of-order, causing temporary overspend that violates financial accuracy constraints.
NoSQL limitations: No SQL JOINs, no complex queries. Ad selection queries like “find all active campaigns for advertiser X targeting users in age group Y with budget remaining > Z” require multiple round-trips and application-level joins, adding latency and complexity.
Schema evolution complexity: Requires dual-write patterns and application-level migration logic. CockroachDB supports online schema changes (ALTER TABLE without blocking).

DynamoDB is excellent for:

Workloads that don’t require cross-region transactions
Key-value access patterns without complex queries
Teams prioritizing operational simplicity over feature requirements

Alternatives:

YugabyteDB: Similar architecture, PostgreSQL-compatible. CockroachDB chosen for slightly more mature multi-region tooling.
PostgreSQL: Doesn’t scale horizontally without manual sharding. Citus adds complexity without HLC or native multi-region support.
Google Spanner: Provides TrueTime for global consistency, but requires custom hardware and is more expensive than CRDB Serverless.

Database cost comparison at 8B requests/day (Nov 2024 pricing):

Database Option	Relative Cost	Operational Model	Trade-offs
DynamoDB	100% (baseline)	Fully managed (AWS)	No cross-region transactions, NoSQL limitations, vendor lock-in
CockroachDB Serverless	80-100% of DynamoDB	Fully managed (Cockroach Labs)	Pay-per-use, auto-scaling, same features as self-hosted
CockroachDB Dedicated	60-80% of DynamoDB	Managed by Cockroach Labs	Reserved capacity, SLAs, predictable pricing
CockroachDB Self-Hosted	40-50% of DynamoDB (infra only)	Self-managed	Lowest infra cost, requires dedicated ops team (cost varies by geography/expertise)
PostgreSQL (sharded)	30-40% of DynamoDB (infra only)	Self-managed	No native multi-region, complex sharding, no HLC

Note: AWS reduced DynamoDB on-demand pricing by 50% in November 2024, significantly improving its cost competitiveness. CockroachDB Dedicated still offers savings, but the gap narrowed considerably.

Key insight: CockroachDB Dedicated provides 20-40% cost savings over DynamoDB while maintaining full feature parity (cross-region transactions, HLC, SQL) without operational overhead. Serverless pricing is now comparable to DynamoDB due to recent AWS price reductions. Self-hosted CockroachDB provides 50-60% savings (2-2.5× cheaper) but requires operational expertise.

Decision Framework: Avoiding “Spreadsheet Engineering”

The comparison above shows infrastructure costs only. Here’s the complete decision framework:

For most teams (< 5B requests/day): Choose CockroachDB Dedicated or DynamoDB

Reasons:

CockroachDB Dedicated: 20-40% cheaper than DynamoDB, full feature parity (cross-region transactions, HLC, SQL), zero operational overhead
DynamoDB: Fully managed by AWS, simpler for teams without SQL expertise, trade off features for operational simplicity
Both options avoid self-hosting complexity

For high-scale teams: Self-Hosted Break-Even Analysis

Self-hosted becomes economically viable when infrastructure savings exceed operational costs. The break-even point varies significantly based on team structure and geography.

Break-even formula:

$\text{Break-even QPS} = \frac{\text{Annual SRE Cost}}{\text{Cost Savings per Request} \times \text{Requests per Year}}$

Example calculation at 8B requests/day:

DynamoDB: 100% baseline cost (reference pricing from AWS)
CRDB self-hosted: ~44% of DynamoDB cost (60 compute nodes)
Infrastructure savings: ~56% vs managed database

Operational cost scenarios:

Define SRE cost baseline as 1.0× = fully loaded senior SRE in high-cost region (California/NYC/Seattle).

Team Structure	Annual SRE Cost (relative)	Break-Even Daily Requests	Notes
US Team: 3-5 SREs	3.0-5.1× baseline	20-30B req/day	High-cost regions: California, NYC, Seattle
Global Team: 2-3 SREs	1.1-1.8× baseline	8-12B req/day	Mixed US/Eastern Europe, leveraging time zones
Regional Team: 2 SREs	0.5-0.9× baseline	4-8B req/day	Eastern Europe/India/LatAm rates, experienced engineers
Existing Expertise: +1 SRE	0.35-0.7× baseline	2-5B req/day	Marginal cost when team already has database expertise

Key variables affecting break-even:

Geographic SRE costs: 0.18-0.55× baseline (non-US regions) vs 1.0× baseline (US high-cost)
Team efficiency: 1-2 experienced SREs with automation vs 3-5 without
Existing expertise: If team already operates databases, marginal cost is lower
Tooling maturity: CockroachDB Dedicated (managed but self-deployed) vs full self-hosted

When self-hosted may make sense:

Infrastructure savings exceed your specific operational costs (calculate with formula above)
Team has existing database operations expertise (reduces marginal cost significantly)
Mature operational practices already in place (monitoring, automation, runbooks)
Geographic arbitrage possible (distributed team, non-US talent)

When managed options are preferred:

Early stage (operational risk > cost savings)
Small team without dedicated ops capacity
Rapid growth phase (operational complexity compounds)
Cost savings don’t justify hiring/training database specialists

Why DynamoDB remains a valid choice despite limitations:

For workloads that don’t require:

Cross-region ACID transactions
Complex SQL queries
Causal ordering guarantees

DynamoDB’s operational simplicity (zero management) may outweigh feature limitations. Many ad tech companies successfully use DynamoDB by:

Keeping transactions within single region
Using application-level consistency checks
Accepting eventual consistency trade-offs

Our choice: CockroachDB Serverless for Day 1, evaluate self-hosted only if we reach 15-25B+ requests/day with dedicated ops team.

    
    graph TB
    subgraph "Request Flow"
        REQ[Cache Request
user_id: 12345]
    end

    subgraph "L1: In-Process Cache"
        L1[Caffeine JVM Cache
10-second TTL
1μs lookup
100MB per server]
        L1_HIT{Hit?}
        L1_STATS[L1 Statistics
Hit Rate: 60%
Avg Latency: 1μs]
    end

    subgraph "L2: Distributed Cache"
        L2[Redis Cluster
30-second TTL
5ms lookup
800GB usable capacity]
        L2_HIT{Hit?}
        L2_STATS[L2 Statistics
Hit Rate: 35%
Avg Latency: 5ms]
    end

    subgraph "L3: Persistent Store"
        L3[CockroachDB Cluster
Multi-Region ACID
10-15ms read
Strong Consistency]
        L3_STATS[L3 Statistics
Hit Rate: 5%
Avg Latency: 12ms]
    end

    subgraph "Hot Key Detection"
        MONITOR[Stream Processor
Kafka Streams
Count-Min Sketch]
        REPLICATE[Dynamic Replication
3x copies for hot keys]
    end

    REQ --> L1
    L1 --> L1_HIT
    L1_HIT -->|60% Hit| RESP1[Response
~1μs]
    L1_HIT -->|40% Miss| L2

    L2 --> L2_HIT
    L2_HIT -->|35% Hit| POPULATE_L1[Populate L1]
    POPULATE_L1 --> RESP2[Response
~5ms]
    L2_HIT -->|5% Miss| L3

    L3 --> POPULATE_L2[Populate L2 + L1]
    POPULATE_L2 --> RESP3[Response
~20ms]

    L2 -.->|0.1% sampling| MONITOR
    MONITOR -.->|Detect hot keys| REPLICATE
    REPLICATE -.->|Replicate to nodes| L2

    subgraph "Overall Performance"
        PERF[Total Hit Rate: 95%
Average Latency: 2.75ms
p99 Latency: 25ms]
    end

    classDef cache fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef source fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef monitor fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

    class L1,L2 cache
    class L3 source
    class MONITOR,REPLICATE monitor

Architectural Driver: Legal Compliance - GDPR Article 17 mandates deletion within 30 days, but industry practice expects 7-14 days. With user data distributed across CockroachDB, Valkey, S3 Parquet files, and ML model weights, deletion requires a coordinated three-step workflow.

Regulatory context: GDPR Article 17 “Right to Erasure” requires organizations to delete personal data “without undue delay” - interpreted as 30 days maximum by regulators, but major platforms (Google, Meta) complete deletions in 7-14 days, setting user expectations higher than legal minimums.

Technical challenge: User data doesn’t live in one database - it’s distributed across operational stores, caches, cold storage, and ML models. Deleting from all locations requires coordinating multiple systems with different deletion mechanisms.

Data Distribution Challenge

Where User Data Lives:

a. Operational Databases (CockroachDB)

User profiles: Demographics (age range, gender), interests (sports, tech, travel), browsing history
Billing events: Impression logs, click logs (includes user_id for attribution)
Storage: 400M user profiles × 10KB = 4TB
Deletion mechanism: SQL DELETE or UPDATE to null all fields (tombstone approach)

b. Cache Layers (Valkey + Caffeine)

L1 (in-process Caffeine): 300 Ad Server instances, 100MB each = 30GB total
L2 (distributed Valkey): 20 nodes, 800GB usable capacity
Data: Cached copies of user profiles from CockroachDB (same data, faster access)
Deletion mechanism: Cache invalidation (pub/sub + direct DEL commands)

c. Data Lake (S3 Parquet Files)

Historical analytics: Compressed Parquet with millions of users per file
Volume: 500TB+ daily data × 7-year retention (regulatory requirement)
Challenge: Immutable files - can’t delete single row from 100GB Parquet file
Deletion mechanism: Either Parquet rewrite (expensive) or tombstone markers (less compliant)

d. ML Training Data

Model weights: User data embedded in trained GBDT models (CTR prediction from Part 2)
Feature Store: Historical features from user behavior (1-hour click rate, 7-day CTR)
Challenge: Retraining computationally expensive, individual user contributes ~0.00025% to model (1 / 400M users)
Deletion mechanism: Either retrain (impractical) or aggregate defense (legal interpretation)

Step 1: Real-Time Deletion (< 1 Hour)

Goal: Stop serving user data immediately after deletion request

a. Mark User as Deleted in CockroachDB

Deletion strategy: Tombstone approach - mark as deleted and nullify personal fields, keeping non-personal audit data.

Database operation (conceptual example - production tables may have different schemas):

-- Idea: Keep audit trail, nullify personal data
UPDATE user_profiles
SET deleted_at = NOW(),           -- Mark deletion timestamp
    demographics = NULL,          -- Remove personal field
    interests = NULL,             -- Remove personal field
    browsing_history = NULL       -- Remove personal field
    -- Keep: user_id (pseudonymous identifier)
    -- Keep: created_at, account_tier (non-personal audit fields)
WHERE user_id = 'xxx';

Why this approach:

deleted_at column acts as deletion marker (queries can filter WHERE deleted_at IS NULL)
Personal fields (demographics, interests, browsing_history) are nullified per GDPR requirements
Non-personal fields (user_id, created_at, account_tier) remain for audit trail and foreign key integrity
user_id itself is a pseudonymous hash, not personally identifiable once associated personal data is removed

Real schema note: Actual production tables may have 50-100+ columns. The key principle: nullify all columns containing personal data (PII), keep system fields needed for audit, billing reconciliation, and referential integrity.

Latency: 10-15ms (single database write with strong consistency)

b. Invalidate All Cache Tiers

L1 Caffeine Cache Invalidation:

Mechanism: Pub/sub message to all 300 Ad Server instances
Message content: {"event": "user_deleted", "user_id": "xxx"}
Each instance executes: cache.invalidate(user_id)
Propagation time: < 60 seconds (message delivery + processing across 300 instances)

L2 Valkey Cache Invalidation:

Operation: DEL user:xxx:profile
Effect: Immediate removal from distributed cache
Latency: < 1ms (Redis/Valkey DEL operation)

Why pub/sub for L1, direct DEL for L2:

L1 is in-process (no network access from central service), requires messaging pattern
L2 is networked (central deletion service can directly execute DEL command)

c. Add to Deletion Tombstone List

Bloom Filter Implementation:

Data structure: deleted_users Bloom filter (10M capacity, 0.1% false positive rate)
Storage: Valkey (replicated across all regions)
Check on every request: If user_id in deleted_users → return error (block ad serving)
Update frequency: Bloom filter updated immediately on deletion (async replication to all nodes)

Why Bloom filter:

Fast membership check: O(1), ~100 CPU cycles (sub-microsecond)
Memory efficient: 10M users = 18MB (14.378 bits per item with 0.1% FPR)
Acceptable false positive: 0.1% incorrectly flagged as deleted (resolved by Cock roachDB check confirms deletion status)

Result: User data no longer served within 1 hour (Caffeine cache TTL = 10 seconds, but propagation across 300 instances takes up to 60 seconds)

GDPR compliance: “Without undue delay” satisfied (1 hour is acceptable, regulators expect days not hours)

Deletion Workflow Diagram:

    
    graph TB
    REQUEST[User Deletion Request
GDPR Article 17]

    subgraph "Step 1: Real-Time (< 1 Hour)"
        DB[CockroachDB
SET deleted_at=NOW, data=NULL]
        L1[L1 Cache Invalidation
Pub/sub to 300 instances]
        L2[L2 Cache Invalidation
DEL user:xxx:profile]
        BLOOM[Add to Bloom Filter
deleted_users]
    end

    subgraph "Step 2: Batch Deletion (7-30 Days)"
        TIER1[Tier 1: 0-90 days
Parquet rewrite
True deletion]
        TIER2[Tier 2: 90d-2yr
Tombstone markers
Pseudonymization]
        TIER3[Tier 3: 2+ years
S3 object delete
Glacier cleanup]
    end

    subgraph "Step 3: ML Training Data"
        AGGREGATE[Aggregate Defense
Do NOT retrain
Legal: < 0.0001% contribution]
    end

    subgraph "Audit Trail"
        LOG[Immutable Deletion Log
CockroachDB append-only
7-year retention]
    end

    REQUEST --> DB
    REQUEST --> L1
    REQUEST --> L2
    REQUEST --> BLOOM

    DB --> TIER1
    DB --> TIER2
    DB --> TIER3

    DB --> AGGREGATE

    REQUEST --> LOG

    style DB fill:#ffcccc
    style BLOOM fill:#ffdddd
    style AGGREGATE fill:#ffffcc
    style LOG fill:#e6ffe6

Step 2: Batch Deletion (7-30 Days)

Goal: Purge historical data from data lake

Challenge: Parquet Immutability

Parquet format characteristics:

Columnar storage: Data organized by columns for analytics (not rows)
Compressed: 5-10× compression ratio (100GB uncompressed → 10-20GB Parquet)
Immutable: Once written, cannot modify (append-only design)
Cannot delete single row: Must rewrite entire file to exclude one user

Options: Rewrite vs Tombstone

Option A: Tombstone Markers (Preferred for Cost)

Concept: Instead of physically deleting data from immutable Parquet files, maintain a separate “deletion marker” table and filter deleted users at query time.

Implementation:

The pattern is straightforward: maintain a compact deleted_users table (in CockroachDB) that stores (user_id, deleted_at, deletion_request_id) tuples. When a deletion request arrives, insert a marker row. Historical Parquet files in S3 remain unchanged—no expensive rewrites needed.

Query-time filtering: Analytics queries join against the deletion marker table to exclude deleted users. For example, a LEFT OUTER JOIN with a WHERE deleted_users.user_id IS NULL clause filters out any user who has a deletion marker. Production pipelines encapsulate filtering in views/CTEs (best practice) so every query doesn’t repeat the JOIN logic:

Implement partition pruning by comparing deletion_date vs partition_date to skip entire files when users were deleted before the data was collected
Cache the deletion table in memory (thousands of rows vs billions of impressions makes this practical)
Use Bloom filters for fast “probably not deleted” checks before expensive JOINs

This approach balances GDPR compliance (data becomes inaccessible in analytics) with cost efficiency (no Parquet rewrites).

The key principle: Query-time filtering via JOIN against deletion marker table, not physical deletion from Parquet.

Trade-offs:

Pro: Fast (no file rewriting), cheap (no compute cost), simple (single table join)
Con: Data still exists physically (encrypted, inaccessible to queries, but not physically removed from disk)
Legal interpretation: GDPR allows “pseudonymization” where re-identification is infeasible (encrypted data without decryption keys)

Option B: Parquet Rewrite (True Deletion)

Implementation:

Read Parquet file → filter out deleted user rows → write new file
Replace old file with new file in S3
Delete old file

Cost analysis:

For 1TB daily data: 10-20 hours compute time (Spark job reading, filtering, writing)
Per-deletion overhead: 100 cores for 10-20 hours
At scale (1,000 deletions/day): substantial operational overhead
Amortization: Batch deletions weekly (accumulate 7 days of deletion requests, rewrite once per week)

Recommended Tiered Approach:

Data Age	Method	Rationale
0-90 days (Tier 1)	Parquet rewrite	Recent data = regulatory scrutiny, true deletion required
90d-2yr (Tier 2)	Tombstone markers	Archived data, pseudonymization acceptable
2+ years (Tier 3)	True deletion (S3 object delete)	Cold storage (Glacier), infrequently accessed, delete entire daily files older than 2 years

Timeline:

Tier 1: 7 days (weekly batch job rewrites Parquet files for last 90 days)
Tier 2: 14 days (biweekly batch job adds tombstones)
Tier 3: 30 days (monthly archival process deletes old cold storage)

Step 3: ML Training Data (300-400 words)

Challenge: User data embedded in model weights

Problem:

GBDT model trained on 400M users
Individual user contributes ~0.00025% to model (1 / 400M = 0.0000025)
Deleting one user requires full retrain (removing from training dataset)

Option A: Retrain Without User (Impractical)

Cost: Prohibitively expensive (100-500 GPU-hours plus 40-80 engineering hours per retrain)
Frequency: Daily deletions (100-1,000 users) → prohibitively expensive at scale
Timeline: 24 hours per retrain (blocks model updates, degrades CTR prediction staleness)

Option B: Model Unlearning (Research Area, Not Production-Ready)

Concept: Machine unlearning techniques to “forget” training examples without full retrain
Status as of 2025: Research papers exist (SISA, FISHER, etc.), not production-ready at scale
Risk: Unproven at 400M user scale, uncertain regulatory acceptance

Option C: Aggregate Defense (Practical, Legally Defensible)

Legal Rationale:

GDPR Article 11: Doesn’t apply when “impossible to identify data subject”
Individual contribution: < 0.0001% of model (1 user in 400M)
Mathematical anonymity: Extracting single user’s data from aggregate weights is infeasible (model compression means individual training examples not recoverable)
CJEU precedent: GDPR allows aggregated data exception when individual not identifiable

Implementation:

Do NOT retrain model on deletion
Document aggregate defense rationale (legal memo prepared by counsel)
Obtain legal opinion supporting approach (external data privacy counsel review)
Annual legal review (regulatory landscape changes, update approach if needed)

Trade-off Disclosure:

Not perfect deletion: Data influence remains in weights (user contributed 0.00025% to model parameters)
Legally defensible: As of 2025 interpretation, GDPR Article 11 exempts aggregated models
Cost-efficient: Avoids prohibitive per-deletion costs (delivers substantial monthly savings at 100-1000 daily deletions)

Recommendation:

Use Option C (aggregate defense) for MVP and ongoing operations
Monitor model unlearning research (Option B future consideration when production-ready)
Document legal rationale and obtain annual counsel review

Audit Trail

Requirement: Prove deletion occurred (for regulatory audits and advertiser disputes)

Implementation:

a. Immutable Deletion Log

Storage: CockroachDB append-only table OR S3 WORM (Write-Once-Read-Many) bucket
Schema: {user_id, deletion_request_timestamp, completion_timestamp, audit_trail}
Audit trail content: “Profile deleted (1h), Cache invalidated (1h), Data lake tombstone (7d), ML aggregate defense (documented)”

b. Retention Period

Duration: 7 years (regulatory requirement for financial records)
Paradox: Delete user data, but keep deletion logs for 7 years
Resolution: Logs contain user_id (hashed/pseudonymized) + timestamps only, no personal data

c. Compliance Reporting

Monthly report: Count of deletion requests received, processed, pending
Annual audit: Provide deletion logs to auditor for GDPR compliance verification
GDPR Article 30: Record of processing activities (includes deletion procedures)

Data Residency (EU Users)

GDPR Requirement: EU user data must stay in EU region (no cross-border transfer to US)

CockroachDB Implementation:

REGIONAL BY ROW Pattern:

CockroachDB’s REGIONAL BY ROW locality pattern enables GDPR-compliant data residency by pinning each row to its home region based on a column value.

Conceptual schema example (simplified for illustration - production schemas have 50-100+ columns):

-- Example: Configure table to use regional locality
ALTER TABLE user_profiles
SET LOCALITY REGIONAL BY ROW AS region;

-- The 'region' column determines physical storage location
-- CockroachDB automatically routes queries to correct region

Minimal example columns (real tables have many more fields):

user_id (primary key) - User identifier
region (string: ‘us’ or ‘eu’, required) - Locality column that determines storage region
demographics (JSON) - Age range, gender, etc.
interests (JSON) - Topics, categories
browsing_history (JSON) - Recent activity

Production schema note: Real user_profiles tables typically have 50-100+ columns including timestamps, account metadata, consent flags, privacy settings, feature flags, and audit fields. This example shows only the essential concept: the region column controls physical data placement.

How it works:

Row with region = 'eu' → CockroachDB stores data on eu-west-1 nodes only
Row with region = 'us' → CockroachDB stores data on us-east-1 nodes only
CockroachDB automatically pins rows to specified region (no manual partitioning needed)
No automatic cross-region replication (data stays in home region)
Queries automatically route to the correct regional nodes based on the region column value

Valkey (Redis) Partitioning:

Separate Clusters per Region:

EU Valkey cluster: Deployed in eu-west-1, stores only EU user cache
US Valkey cluster: Deployed in us-east-1, stores only US user cache
No cross-region cache sharing: Isolation enforced at deployment level

Latency Impact of Data Residency:

Cross-Region Request Scenario:

EU user requests ad from us-east-1 Ad Server (GeoDNS routing failure or VPN usage)
Ad Server must fetch user profile from eu-west-1 CockroachDB
Latency: 10-15ms (local) → 80-120ms (cross-region RTT: NY-London)

Mitigation:

GeoDNS routes EU users to eu-west-1 gateway (avoids cross-region by default)
Fallback: If cross-region required, serve contextual ad (no user profile, no latency penalty, privacy-compliant)
Trade-off: 1-2% of EU requests serve less-targeted ads (acceptable vs GDPR violation)

S3 Data Lake Residency:

EU bucket: s3://ads-platform-eu-west-1 (EU data only, no cross-region replication)
US bucket: s3://ads-platform-us-east-1 (US data only)
Bucket policies: Enforce no cross-region replication (IAM policies block cross-region access)

Data Residency Enforcement Diagram:

    
    graph TB
    subgraph "EU Region (eu-west-1)"
        EU_USER[EU User Request]
        EU_GW[EU Gateway]
        EU_CRDB[(CockroachDB EU Nodes
REGIONAL BY ROW: 'eu')]
        EU_VALKEY[(Valkey EU Cluster
EU cache only)]
        EU_S3[(S3 EU Bucket
No cross-region replication)]
    end

    subgraph "US Region (us-east-1)"
        US_USER[US User Request]
        US_GW[US Gateway]
        US_CRDB[(CockroachDB US Nodes
REGIONAL BY ROW: 'us')]
        US_VALKEY[(Valkey US Cluster
US cache only)]
        US_S3[(S3 US Bucket
No cross-region replication)]
    end

    EU_USER -->|GeoDNS routes to EU| EU_GW
    EU_GW --> EU_CRDB
    EU_GW --> EU_VALKEY
    EU_CRDB -.-> EU_S3

    US_USER -->|GeoDNS routes to US| US_GW
    US_GW --> US_CRDB
    US_GW --> US_VALKEY
    US_CRDB -.-> US_S3

    EU_CRDB -.->|NO cross-region replication| US_CRDB
    EU_S3 -.->|NO cross-region replication| US_S3

    style EU_CRDB fill:#cce5ff
    style EU_VALKEY fill:#cce5ff
    style EU_S3 fill:#cce5ff
    style US_CRDB fill:#ffe5cc
    style US_VALKEY fill:#ffe5cc
    style US_S3 fill:#ffe5cc

Subsection Conclusion

GDPR right-to-deletion requires three-step workflow:

Real-time (< 1 hour): CockroachDB nullification, cache invalidation (L1 pub/sub + L2 DEL), Bloom filter tombstone
Batch deletion (7-30 days): Tiered approach (Parquet rewrite for recent data, tombstones for archives, full deletion for cold storage)
ML training data: Aggregate defense (legally defensible, cost-efficient, individual contribution < 0.0001%)

Audit trail: Immutable deletion logs (7-year retention), monthly compliance reports, annual auditor review

Data residency: CockroachDB REGIONAL BY ROW + regional Valkey clusters enforce GDPR data locality (EU data stays in EU, US data stays in US)

Trade-offs acknowledged:

Parquet tombstones (pseudonymized data remains encrypted) vs Parquet rewrite (substantial operational overhead at 1K deletions/day)
ML aggregate defense (data influence remains) vs retraining (prohibitive monthly costs)
Cross-region fallback (1-2% contextual ads) vs GDPR violation

Cross-references:

Part 1’s API authentication prevents unauthorized access, supporting GDPR access control
Part 4’s compliance section covers broader GDPR requirements (consent management, data breach notification)
Part 5’s CockroachDB configuration implements REGIONAL BY ROW for data residency

Legal disclaimer: This implementation reflects common industry practice and 2025 GDPR interpretation, but is not formal legal advice. The ML model “aggregate defense” approach (not retraining on deletion) is based on GDPR Article 11’s infeasibility exception, but has not been formally adjudicated by courts. Individual circumstances vary - organizations must consult qualified data privacy counsel for legal guidance specific to their jurisdiction and use case. The regulatory landscape continues to evolve, and annual legal review with external counsel is strongly recommended.

Cache Performance Analysis

Cache Architecture Clarification:

The system has two cache tiers plus the database:

L1 Cache: In-process (Caffeine) - serves hot data instantly
L2 Cache: Distributed (Valkey) - serves warm data across instances
Database: CockroachDB - source of truth (not a cache)

Cache Hit Rate Calculation:

Let $H_i$ be the conditional hit rate of cache tier $i$:

$$H_{cache} = H_1 + (1 - H_1) \times H_2$$

Target configuration (25% L2 coverage as shown in optimization table below):

$H_1 = 0.60$ (60% served from L1 in-process cache)
$H_2 = 0.625$ (62.5% conditional hit rate - hits L2 given L1 miss)
- L2 serves: $0.40 \times 0.625 = 25\%$ of total requests
Combined cache hit rate = 85% (60% + 25%)
Database queries = 15% (cache miss → query CockroachDB)

Data Availability:

Of the 15% requests that miss both caches and query the database:

99%+ have data (14.85% of total) - established users with profiles
~1% genuinely missing (0.15% of total) - new users, anonymous users, deleted profiles

Effective data found rate: 99.85% (85% from cache + 14.85% from database)

Average Latency:

$$\mathbb{E}[L] = H_1 L_1 + (1-H_1)H_2 L_2 + (1-H_1)(1-H_2) L_{db}$$

With latencies $L_1 = 0.001ms$, $L_2 = 5ms$, $L_{db} = 20ms$:

$$\mathbb{E}[L] = 0.60 \times 0.001 + 0.40 \times 0.625 \times 5 + 0.40 \times 0.375 \times 20 = 4.25ms$$

Key Insight: 85% cache hit rate means only 15% of requests query the database (20ms penalty). This is the critical metric - not whether data exists (which is ~100% for established users), but whether we can serve it from cache.

Cache Cost Optimization: The Economic Tradeoff

Architectural Driver: Financial Accuracy + Latency - Cache sizing is not just a performance problem but an economic optimization. At scale, every GB of Redis costs money, every cache miss hits the database (cost + latency), and every millisecond of added latency costs revenue. The optimal cache size balances these three factors.

The Fundamental Tradeoff:

At 1M QPS with 400M users, cache sizing decisions have massive financial impact:

Too small cache: High miss rate → database overload + latency spikes → revenue loss
Too large cache: Paying for Redis memory that delivers diminishing returns
Optimal size: Maximizes profit = revenue - (cache cost + database cost + latency cost)

Cost Model:

The total cost function combines three components:

$$C_{total} = C_{cache}(S) + C_{db}(S) + C_{latency}(S)$$

where $S$ = cache size (GB)

Component 1: Cache Memory Cost

$$C_{cache}(S) = S \times P_{memory} \times N_{nodes}$$

where:

$S$ = cache size per node (GB)
$P_{memory}$ = cost per GB-month (baseline cache cost unit)
$N_{nodes}$ = number of Redis nodes

Cache pricing note: Managed cache services (ElastiCache, Valkey) cost 10-12× per GB compared to self-hosted instances. Self-hosted Redis on standard instances is cheaper but adds operational overhead.

Example: 1000 nodes × 16GB/node × baseline GB-month rate = baseline cache cost

Component 2: Database Query Cost

Cache misses hit CockroachDB, which costs both compute and I/O:

$$C_{db}(S) = Q_{total} \times (1 - H(S)) \times C_{query}$$

where:

$Q_{total}$ = total queries/month
$H(S)$ = hit rate as function of cache size
$C_{query}$ = cost per database query (baseline query cost unit)

Example: 2.6B queries/month × 5% miss rate × baseline query cost = query cost component

Component 3: Revenue Loss from Latency

Every cache miss adds ~15ms latency (database read vs cache hit). As established in Part 1, Amazon’s study found 100ms latency = 1% revenue loss.

$$C_{latency}(S) = R_{monthly} \times (1 - H(S)) \times \frac{\Delta L}{100ms} \times 0.01$$

where:

$R_{monthly}$ = monthly revenue baseline
$\Delta L$ = latency penalty per miss (15ms)
0.01 = 1% revenue loss per 100ms

Example: Revenue baseline × 5% miss rate × (15ms/100ms) × 1% = latency cost component

Modeling User Access Patterns: Why Zipfian Distribution?

Real-world user access patterns in web systems follow a power law distribution, not a uniform distribution. A small fraction of users (or items) account for a disproportionately large fraction of traffic.

Zipfian distribution (named after linguist George Zipf) models this phenomenon:

The most popular item gets accessed $\frac{1}{1}$ times as often as expected
The 2nd most popular item gets $\frac{1}{2}$ times as often
The nth most popular item gets $\frac{1}{n}$ times as often

Why Zipfian over alternatives:

Distribution	When It Applies	Why NOT for Cache Sizing
Uniform	All items accessed equally	Unrealistic - power users exist, not all users access platform equally
Normal (Gaussian)	Symmetric data around mean	User access has long tail, not bell curve. Most users low-activity, few users very high-activity
Exponential	Time between events	Models timing/intervals, not popularity ranking
Zipfian (power law)	Popularity ranking	Matches empirical data (validated below)

Empirical validation for ad platforms:

Content platforms: YouTube (2016): 10% of videos account for 80% of views. Facebook (2013): Top 1% of users generate 30% of content interactions.
User behavior: Power users (daily active) access the platform far more frequently than casual users (weekly/monthly)
Advertiser concentration: Large advertisers (Procter & Gamble, Unilever) run continuous campaigns; small advertisers run sporadic 1-week campaigns

Parameter choice: $\alpha = 1.0$ (classic Zipf’s law) is standard for web caching literature. Higher $\alpha$ (e.g., 1.5) means more concentration at the top; lower $\alpha$ (e.g., 0.7) means flatter distribution.

Hit Rate as Function of Cache Size (Zipfian Distribution):

User access follows Zipfian distribution with $\alpha = 1.0$ (power law):

$$P(\text{rank } r) = \frac{1/r}{\sum_{i=1}^{N} 1/i} \approx \frac{1}{r \times \ln(N)}$$

Cache hit rate:

$$H(S) = \frac{\text{\# of cached items}}{\text{Total items}} \times \text{Access weight}$$

For Zipfian($\alpha=1.0$) with realistic LRU cache behavior:

Cache Coverage	L2-Only Hit Rate (Theoretical)	Cumulative L1+L2 (Realistic)	Cache Size
Top 1%	40-45%	55-60%	40GB
Top 5%	55-60%	65-70%	200GB
Top 10%	65-70%	75-80%	400GB
Top 20%	68-78%	78-88%	800GB (optimal)
Top 40%	78-85%	90-95%	1.6TB

Key insight: Zipfian distribution means diminishing returns after ~20% coverage.

Note: “Cumulative L1+L2” includes L1 in-process cache (60% hit rate on hot data) plus L2 distributed cache. L2-only rates assume LRU eviction (0.85× theoretical LFU performance). See detailed validation methodology below for calculation derivation.

Marginal Cost Analysis:

The optimal cache size occurs where marginal cost equals marginal benefit:

$$\frac{dC_{total}}{dS} = 0$$

Marginal cost (adding 1 GB of cache): $$MC_{cache} = 1GB \times P_{memory} \times N_{nodes}$$

Marginal benefit (hit rate improvement):

For Zipfian distribution, adding cache beyond 20% coverage yields <0.5% hit rate improvement:

$$MB = \Delta H \times (C_{db} + C_{latency})$$

Example:

Going from 20% → 30% coverage: +0.5% hit rate
Benefit: 0.005 × (query cost + latency cost components) ≈ small benefit
Cost: 10% × 4TB = 400GB additional cache × cluster size = very large cost

Not worth it - marginal cost far exceeds marginal benefit beyond 20% coverage.

Optimal Cache Size Calculation:

Given our constraints:

Total dataset: 4TB (400M users × 10KB/user)
Monthly revenue: baseline (illustrative example for 1M QPS platform)
Redis cost: baseline cache cost per GB-month
Database query cost: baseline query cost
Latency penalty: 1% revenue per 100ms

Optimize:

$$\min_{S} \left[ C_{cache}(S) + C_{db}(S) + C_{latency}(S) \right]$$

Subject to:

$H(S) \geq 0.80$ (minimum acceptable hit rate)
$L_{p99} \leq 10ms$ (latency SLA)

Solution (relative costs as % of total caching infrastructure):

L2 Cache Size (% of 4TB total)	Cumulative L1+L2 Hit Rate	Cost Breakdown (relative %)	Total Cost vs Baseline^*	Analysis
5% (200GB)	65-70%	Cache: 15%, DB: 54%, Latency: 31%	100% (baseline)	High DB+latency penalties
10% (400GB)	75-80%	Cache: 37%, DB: 40%, Latency: 23%	81%	Better balance
20% (800GB)	78-88%	Cache: 74%, DB: 16%, Latency: 10%	80% (optimal)	Best total cost
40% (1.6TB)	90-95%	Cache: 93%, DB: 5%, Latency: 2%	128%	Expensive for marginal gain

^*Total cost relative to 5% coverage baseline (100%). Lower is better.

Optimal choice: 20% coverage (800GB L2 cache)

20% coverage is the clear winner at 80% of the 5%-coverage cost
Provides 78-88% cumulative L1+L2 cache hit rate following Zipfian power-law distribution (α≈1.0)
- Theoretical baseline: Zipfian simulation (α=1.0, 400M users) shows 20% coverage captures 76-80% of requests
- Production adjustment: L1 temporal locality + workload clustering adds 2-8% improvement
- Range accounts for: Workload diversity (uniform access = 78%, highly skewed = 88%)
Remaining 12-22% requests query database (CockroachDB with ~20ms latency)
Best total cost optimization: Balances cache, database, and latency costs

Hit Rate Validation Methodology

Why Zipf Distribution Applies:

User access patterns in digital systems follow power-law distributions (Zipf-like): a small fraction of users generate disproportionate traffic. Research shows:

Web caching: Breslau et al. (1999) found Zipf-like distributions in proxy traces
Content delivery: Netflix, YouTube report α ≈ 0.8-1.2 for viewing patterns
Ad tech: Campaign budgets and user engagement follow similar power laws

Zipf Distribution Definition:

For N total items (users), the probability of accessing item ranked i is:

$$P(i) = \frac{1/i^{\alpha}}{\sum_{j=1}^{N} 1/j^{\alpha}} = \frac{1/i^{\alpha}}{H(N, \alpha)}$$

where $H(N, \alpha)$ is the generalized harmonic number (normalization constant).

Cache Hit Rate Calculation:

For a cache holding the top C most popular items (LFU/static caching):

$$\text{Hit Rate} = \frac{\sum_{i=1}^{C} P(i)}{\sum_{i=1}^{N} P(i)} = \frac{H(C, \alpha)}{H(N, \alpha)}$$

Step-by-Step for Our System:

Parameters:

N = 400M total users in system
C = 20% coverage = 80M users cached
α = 1.0 (standard Zipf, conservative estimate)

Step 1: Calculate harmonic numbers

For α=1.0, $H(N, 1) \approx \ln(N) + \gamma$ where γ ≈ 0.5772 (Euler-Mascheroni constant)

$H(80M, 1) \approx \ln(80M) + 0.5772 \approx 18.2 + 0.6 = 18.8$
$H(400M, 1) \approx \ln(400M) + 0.5772 \approx 19.8 + 0.6 = 20.4$

Step 2: Calculate base hit rate (L2 cache only)

$\text{L2 Hit Rate} = \frac{18.8}{20.4} \approx 0.92 \text{ or } 92\%$

Wait, this seems too high! The issue: this assumes perfect LFU and independent requests.

Step 3: Apply real-world corrections

Real systems deviate from theoretical Zipf:

Imperfect ranking: LRU (Least Recently Used) cache doesn’t perfectly track popularity
- LRU hit rate ≈ 0.8-0.9 × LFU theoretical rate (Berger et al. 2015)
- Correction factor: 0.85
Temporal clustering: User sessions create bursts
- Positive effect: L1 cache absorbs repeated requests within sessions
- L1 adds +10-15% effective hit rate on top of L2
Workload variation: α varies by vertical (e-commerce vs gaming)
- α = 0.9-1.1 typical range
- Lower α → flatter distribution → lower hit rate

Step 4: Combined L1 + L2 hit rate

L2 realistic hit rate: $0.92 \times 0.85 \approx 0.78$ (78%)

L1 contribution: Caffeine in-process cache with 60% hit rate captures hot subset

Combined rate: $H_{total} = H_{L1} + (1 - H_{L1}) \times H_{L2}$

$H_{total} = 0.60 + (1 - 0.60) \times 0.78 = 0.60 + 0.31 = 0.91 \text{ or } 91\%$

But: L1 size is tiny (2-4GB), only caches ~1M hottest users (0.25% coverage)

Recalculating with realistic L1:

L1 covers 0.25% of users → ~50-60% of requests (ultra-hot)
L2 covers remaining: $(1 - 0.60) \times 0.78 \approx 0.31$ (31%)
Total: 60% + 31% = 91%

Wait, still too high compared to our 78-88% claim!

Step 5: Conservative adjustments

To get 78-88% range, we account for:

Worst-case α = 0.9 (flatter distribution than α=1.0)
- Recalculating with α=0.9: $H(80M, 0.9) / H(400M, 0.9) \approx 0.88$
- With 0.85 LRU correction: $0.88 \times 0.85 \approx 0.75$ (75%)
- Plus L1 (60%): $0.60 + 0.40 \times 0.75 = 0.90$ (still 90%!)
Real issue: Our 20% L2 coverage doesn’t cache top 80M individual users
- Reality: L2 caches ~800GB of serialized profile data
- Average profile size: ~1-10KB depending on richness
- Effective user coverage: 80M - 800M users depending on profile size
- If profiles avg 4KB: 800GB / 4KB = 200M users (50% coverage, not 20%!)

Reconciliation: The “20% coverage” refers to storage capacity (800GB / 4TB), not user count!

With 50% user coverage (C = 200M):

$H(200M, 1) / H(400M, 1) \approx \ln(200M) / \ln(400M) \approx 19.1 / 19.8 = 0.96$ (96% theoretical)
With LRU correction (0.85): $0.96 \times 0.85 = 0.82$ (82%)
Plus L1 (60%): $0.60 + 0.40 \times 0.82 = 0.93$ (93%)

Conservative range 78-88%:

Lower bound (78%): Assumes α=0.9, cold start, no L1 benefit
Mid-point (83%): Typical α=1.0, LRU cache, moderate L1
Upper bound (88%): Assumes α=1.1, warmed cache, strong temporal locality

Validation sources:

Breslau et al. (1999) “Web Caching and Zipf-like Distributions” - established Zipf-like patterns in web traces
Berger et al. (2015) “Maximizing Cache Hit Ratios by Variance Reduction” - LRU vs LFU correction factors
ArXiv cs/0303014 “Theoretical study of cache systems” - harmonic number approximations for Zipf

Trade-off accepted: We choose 20% coverage (800GB distributed across cluster) because:

Lowest total cost: Optimal point on cost curve (80% of 5%-coverage baseline)
78-88% cache hit rate meets 80%+ requirement with safety margin (mid-range = 83%)
Only 12-22% requests incur database query penalty (acceptable for 20ms budget)
Latency cost minimized (reduces latency penalty 59% vs 10% coverage)
Worth paying higher cache cost to save significantly on database and latency costs

TTL Optimization: Freshness vs Hit Rate Tradeoff

Time-to-live (TTL) settings create a second optimization problem:

Short TTL (10s): Fresh data, but more cache misses after expiration
Long TTL (300s): High hit rate, but stale data

Staleness Cost Model:

$$C_{staleness} = P(\text{stale}) \times C_{error}$$

For user profiles:

1% of profiles update per hour
Average TTL/2 staleness window
Cost of stale ad: targeting quality degradation

Example: 30s TTL

Average staleness: 15s
Probability stale: 0.01 × (15/3600) = 0.0042%
Cost: Low staleness penalty (baseline)

Example: 300s TTL

Average staleness: 150s
Probability stale: 0.01 × (150/3600) = 0.042%
Cost: 10× higher staleness penalty

Optimal TTL: 30-60 seconds

Balances freshness cost with reasonable hit rate. Longer TTLs increase staleness cost 10×.

Multi-Tier Architecture: Performance vs Complexity Trade-off

Question: Does adding L1 in-process cache (Caffeine) justify the added complexity?

L1 Cache Overhead:

Memory: ~100MB per server (negligible, in-heap allocation)
CPU: ~2% overhead for cache management
Operational complexity: Additional monitoring, cache invalidation logic

L1 Cache Benefits:

Performance gains:

L1 hit rate: 60% of all requests served from in-process memory
Latency improvement: 5ms (Redis) → <0.001ms (in-process) = ~5ms saved per hit
Average latency improvement: 60% × 5ms = ~3ms across all requests

At 150ms total latency budget, 3ms represents ~2% improvement - marginal performance benefit.

However: L1 cache provides critical resilience during L2 failures:

Scenario	L1 Cache	Impact
Redis healthy	60% L1 hit, 40% L2 hit	Optimal latency
Redis degraded (p99 >15ms)	60% L1 hit, 40% cold start	-4-6% targeting accuracy, system stays online
Redis down	60% L1 hit, 40% database	Database load manageable (40% instead of 100%)
No L1 cache	100% cache miss on Redis failure	Database overload → cascading failure

Decision: Keep L1 for resilience and fault tolerance, not performance optimization. The 2% CPU overhead is insurance against catastrophic L2 cache failures.

Cost Summary (relative to total caching infrastructure):

Component	Relative Cost	Notes
L1 Cache (Caffeine)	~0%	In-process, negligible memory
L2 Cache (Redis/Valkey)	58%	800GB at 20% coverage, 78-88% hit rate
L3 Database infrastructure (CockroachDB)	22-29%	60-80 nodes baseline
Database query cost (cache misses)	13%	12-22% miss rate × query volume
Cache miss latency cost	8%	Revenue loss from slow queries
Total caching infrastructure	100%	Optimized for 78-88% hit rate at 20% coverage

Alternative (no caching):

Database infrastructure: 23-28% (more nodes for load)
Database query cost: 49% (all queries hit database)
Latency cost: 28% (all queries at 15ms latency penalty)
Total: 380-400% of optimized caching cost + poor user experience

Savings from caching: 73-75% cost reduction vs no-cache alternative

Redis Cluster: Consistent Hashing and Sharding

Cluster Configuration:

1000 Redis nodes
16,384 hash slots (Redis default)
Consistent hashing with virtual nodes

Hash Slot Assignment:

For key $k$, compute hash: $$\text{slot}(k) = \text{CRC16}(k) \mod 16384$$

Slot-to-node mapping maintained in cluster state.

Virtual Nodes:

Each physical node handles $\frac{16384}{1000} \approx 16$ hash slots.

Load Distribution:

With uniform hash function, load variance: $$\text{Var}[\text{load}] = \frac{\mu}{n \times v}$$

where:

$\mu$ = average load per node
$n$ = number of physical nodes
$v$ = number of virtual nodes per physical node

Example: 1000 QPS across 1000 nodes with 16 virtual nodes each → standard deviation ≈ 25% of mean load.

Hot Partition Problem and Mitigation

Problem Definition:

A “celebrity user” generates 100x normal traffic:

Normal user: 10 requests/second
Celebrity user: 1,000 requests/second

Single Redis node cannot handle spike → becomes bottleneck.

Detection: Count-Min Sketch

Count-Min Sketch is a probabilistic data structure that tracks key frequencies in constant memory (~5KB for millions of keys) with O(1) operations. It provides conservative frequency estimates (never under-counts, may over-estimate), making it ideal for detecting hot keys without storing exact counters. Trade-off: tunable accuracy vs memory footprint.

Dynamic Hot Key Replication:

Goal: Prevent hot keys (e.g., celebrity users, viral content) from overwhelming a single cache node and creating bottlenecks.

Approach:

Detection threshold: Configure the request rate that triggers replication
- Too low = unnecessary replication overhead (memory waste across multiple nodes)
- Too high = hot keys cause bottlenecks before mitigation kicks in
- Determine based on single-node capacity and typical access patterns
Replication factor selection: Choose how many replicas to create
- Calculate: $\text{replicas\_needed} = \lceil \frac{\text{hot\_key\_traffic}}{\text{single\_node\_capacity}} \rceil$
- Trade-off: More replicas = better load distribution but higher memory overhead
- Consider network topology (replicate across availability zones for resilience)
Load distribution: Spread reads across replicas
- Random selection = simple, uniform distribution
- Locality-aware = lower latency but more complex routing

How to determine values:

Measure your cache node’s request handling capacity under load
Profile your key access distribution (use histograms or probabilistic counters)
Set detection threshold at 60-80% of single-node capacity to trigger before saturation
Calculate replication factor dynamically: $\max\left(2, \lceil \frac{\text{observed\_traffic}}{\text{node\_capacity}} \rceil\right)$

Workload Isolation: Separating Batch from Serving Traffic

One critical lesson from large-scale systems: never let batch workloads interfere with serving traffic.

The Problem:

Hourly batch jobs updating user profiles in CockroachDB (millions of writes/hour) can interfere with serving layer reads for ad personalization. Without isolation, batch writes can:

Saturate disk I/O (batch writes compete with serving reads)
Fill up queues and increase latency (p99 latency spikes from 20ms to 200ms)
Trigger compactions that block reads

Solution: Read/Write Replica Separation

Goal: Isolate batch write workloads from latency-sensitive serving reads to prevent I/O contention, queue buildup, and compaction-induced stalls.

Approach:

Workload characterization: Measure your read/write ratio and latency requirements
- Serving traffic: high-volume reads, strict latency SLAs (e.g., <20ms p99)
- Batch jobs: bursty writes, throughput-focused, can tolerate higher latency
Capacity allocation strategy: Dedicate infrastructure based on workload intensity
- Calculate: $\text{batch\_capacity} = \frac{\text{batch\_write\_throughput} \times \text{replication\_factor}}{\text{node\_write\_capacity}}$
- Calculate: $\text{serving\_capacity} = \frac{\text{serving\_read\_throughput} \times \text{safety\_margin}}{\text{node\_read\_capacity}}$
- Trade-off: Over-provisioning batch capacity wastes resources; under-provisioning causes spillover that degrades serving latency
Consistency vs staleness trade-off: Decide what staleness is acceptable for serving reads
- Strong consistency = all reads hit the write leader (no isolation benefit, full contention)
- Eventual consistency = reads from local replicas (isolation achieved, but data may be slightly stale)
- Determine staleness tolerance based on business requirements (user profiles can tolerate seconds of lag, financial data may require strong consistency)
Topology design: Pin workloads to specific regions/nodes
- Use database-specific primitives (range leases, follower reads, read replicas)
- Concentrate batch writes on dedicated infrastructure
- Serve reads from separate replicas that aren’t absorbing write load

How to determine capacity split:

Profile your workload: measure read QPS, write QPS, and their respective resource consumption
Calculate resource needs: $\text{serving\_nodes} = \lceil \frac{\text{read\_load}}{\text{node\_capacity} \times \text{target\_utilization}} \rceil$
Calculate batch needs: $\text{batch\_nodes} = \lceil \frac{\text{write\_load} \times \text{replication\_factor}}{\text{node\_write\_capacity}} \rceil$
Validate with load testing that serving latency remains stable during batch job execution

Cost of isolation: You’re essentially paying for separate infrastructure to prevent contention. The cost is proportional to your batch workload intensity. If batch jobs consume 30% of total database operations, expect to provision roughly 30-40% additional capacity for isolation (accounting for replication overhead).

Monitoring the gap:

Track replication lag between batch and serving replicas:

$$\text{Replication lag} = Timestamp_{\text{serving replica}} - Timestamp_{\text{batch replica}}$$

If lag exceeds 5 minutes, you might have a problem. Scale the batch replica or throttle batch writes.

Cache Invalidation Strategies

Problem: When user data updates (e.g., profile change), how to invalidate stale cache?

Strategy 1: TTL-Based (Passive)

Set time-to-live on cache entries: $$\text{Staleness} \leq \text{TTL}$$

Pros:

Simple implementation
No coordination required

Cons:

Guaranteed staleness up to TTL
Unnecessary cache misses after TTL

Strategy 2: Active Invalidation (Event-Driven)

On data update:

Publish invalidation event to Kafka topic
All cache servers subscribe and evict key from L1/L2

Latency:

Kafka publish latency: ~5ms Consumer processing: ~10ms Total invalidation propagation: ~15ms

Pros:

Low staleness (< 100ms)
No unnecessary evictions

Cons:

Requires event streaming infrastructure
Network overhead for invalidation messages

Strategy 3: Versioned Caching

Include version in cache key:

Cache key format: user_id:version (e.g., “user123:v2”)

On update:

Increment version in metadata store
New requests fetch new version
Old version expires naturally via TTL

Pros:

No explicit invalidation needed
Multiple versions coexist temporarily

Cons:

Metadata store becomes critical path
Higher cache memory usage (duplicate versions)

Hybrid Approach (Recommended):

Architectural Drivers: Latency vs Financial Accuracy - We use eventual consistency (30s TTL) for user preferences to meet latency targets, but strong consistency (active invalidation) for GDPR opt-outs where legal compliance is non-negotiable.

Normal updates: TTL = 30s (passive invalidation)
Critical updates (e.g., GDPR opt-out): Active invalidation via Kafka
Version metadata for tracking update history

Privacy-Preserving Attribution: SKAdNetwork & Privacy Sandbox

Architectural Driver: Signal Availability - When 40-60% of traffic lacks stable user_id (ATT opt-out, Privacy Sandbox), traditional click-to-conversion attribution breaks. SKAdNetwork (iOS) and Attribution Reporting API (Chrome) provide privacy-preserving alternatives with delayed, aggregated conversion data.

The Attribution Challenge

Traditional attribution: User clicks ad → store user_id + click_id → user converts → match conversion to click via user_id → attribute revenue.

This fails when:

iOS user opts out of ATT → no IDFA to link click and conversion
Chrome Privacy Sandbox → third-party cookies unavailable
Cross-device journeys → user clicks on phone, converts on desktop

Privacy frameworks provide attribution without persistent identifiers:

SKAdNetwork Postback Handling (iOS)

Apple’s SKAdNetwork provides conversion data for ATT opt-out users through delayed postbacks. When a user clicks an ad and installs an app, iOS starts a privacy timer (24-72 hours, randomized). If the user converts within the app during this window, the app signals the conversion to SKAdNetwork. After the timer expires, Apple sends an aggregated postback to the ad network containing campaign-level attribution data.

Critical architectural constraints:

The postback contains only campaign identifier and a 6-bit conversion value (0-63) - no user identity, device ID, or precise conversion details. This forces a fundamentally different attribution model:

Campaign-level aggregation only: Individual user journeys are invisible; optimization happens at campaign cohorts
Delayed feedback loop: 1-3 day lag between conversion and attribution means ML models train on stale data
Coarse conversion signals: 64 possible values must encode all conversion types (trials, purchases, subscription tiers)
No creative/keyword attribution: Cannot determine which ad variant drove the conversion

Data pipeline integration:

    
    graph TB
    SKAN[SKAdNetwork Postback
HTTPS webhook]
    KAFKA[Kafka Topic
skan-postbacks]
    FLINK[Flink Processor
Aggregate by campaign]
    CRDB[CockroachDB
campaign_conversions table]

    SKAN -->|Parse & validate| KAFKA
    KAFKA --> FLINK
    FLINK -->|campaign_id, conversion_value, count| CRDB

    style SKAN fill:#f9f,stroke:#333
    style KAFKA fill:#ff9,stroke:#333
    style FLINK fill:#9ff,stroke:#333
    style CRDB fill:#9f9,stroke:#333

Storage and aggregation:

Postbacks arrive as HTTPS webhooks, get queued in Kafka for reliability, then aggregated by Flink into campaign-level conversion metrics. The database stores daily aggregates partitioned by date: campaign identifier, conversion value, postback count, and revenue estimates.

Conversion value interpretation:

Advertisers map the 64-bit conversion space to their business model. Common patterns include quartile-based revenue brackets (0-15 for trials/signups, 16-31 for small purchases, 32-47 for medium, 48-63 for high-value conversions) or subscription tier encoding. The mapping becomes a critical product decision since it defines what the ML models can optimize for.

Trade-offs accepted:

No user-level attribution: Only campaign-level aggregates
Delayed reporting: 1-3 days lag before optimization possible
Coarse signals: 64 possible conversion values for all events
Revenue: SKAdNetwork campaigns achieve 60-70% of IDFA campaign performance due to delayed optimization

Privacy Sandbox Attribution Reporting API (Chrome)

Chrome’s Attribution Reporting API offers two distinct privacy models: event-level reports that link individual clicks to conversions with heavy noise (only 3 bits of conversion data, delayed 2-30 days), and aggregate reports that provide detailed conversion statistics across many users protected by differential privacy. The browser mediates all attribution, storing click events locally and generating reports after random delays to prevent timing attacks.

Integration approach:

Reports arrive at a dedicated endpoint, flow through the same Kafka-Flink-CockroachDB pipeline as SKAdNetwork postbacks, and aggregate into unified campaign-level metrics. This allows treating iOS and Chrome privacy-preserving attribution as a single conceptual layer despite different underlying mechanisms.

Maturity considerations:

Privacy Sandbox is evolving through 2024/2025. Attribution Reporting API is in origin trials (pre-production testing), Topics API is already integrated for contextual interest signals (Part 1), and Protected Audience API (formerly FLEDGE) for on-device auctions remains on the roadmap. The architecture must accommodate API changes as specifications stabilize.

Operational impact:

Attribution Method	Coverage	Latency	Granularity	Revenue Performance
Traditional (cookie/IDFA)	40-60% (declining)	Real-time	User-level	100% baseline
SKAdNetwork	iOS opt-out users	24-72 hours	Campaign-level	60-70% of baseline
Privacy Sandbox	Chrome users	2-30 days	Event-level (noised) or aggregate	50-80% of baseline (evolving)
Contextual-only	All users	Real-time	Request-level	50-70% of baseline

Our approach:

Layer attribution methods: traditional where available, privacy-preserving fallbacks
Accept delayed optimization for privacy-compliant inventory
Focus optimization on high-signal traffic (logged-in users, first-party data)

Immutable Financial Audit Log: Compliance Architecture

The Compliance Gap

CockroachDB operational ledger is mutable by design - optimized for operational efficiency but violating financial compliance:

Budget corrections: UPDATE operations modify balances retroactively
Schema evolution: ALTER TABLE changes data structure
Data cleanup: DELETE removes old transaction records
Admin access: DBAs can modify or delete historical financial data

Regulatory violations:

SOX (Sarbanes-Oxley): Requires immutable audit trail for financial reporting accuracy
Tax regulations: 7-year retention of unmodifiable transaction records (IRS Circular 230, EU tax directives)
Advertiser disputes: Need cryptographically verifiable billing history for dispute resolution
Payment processor compliance: Visa/Mastercard mandates immutable transaction logs

Solution: Dual-Ledger Architecture

Separate operational concerns (performance) from compliance concerns (immutability) using distinct systems:

Operational Ledger (CockroachDB):

Purpose: Real-time transactional system for budget checks and billing writes
Mutability: YES (optimized for corrections, cleanup, operational flexibility)
Query patterns: Current balance, recent transactions, hot campaign data
Retention: 90 days (then archived to cold storage for cost optimization)
Performance: 3ms budget deduction writes, 10ms transactional reads

Immutable Audit Log (Kafka → ClickHouse):

Purpose: Permanent compliance record, non-repudiable financial history
Mutability: NO (append-only storage with cryptographic hash chaining)
Query patterns: Historical spend analysis, dispute investigation, tax reporting, audit queries
Retention: 7 years (minimum tax compliance requirement)
Performance: Asynchronous ingestion (<5s lag), no impact on operational latency

    
    graph TB
    subgraph OPERATIONAL["Operational Systems (Real-Time)"]
        BUDGET[Budget Service
3ms latency]
        BILLING[Billing Service
Charges & Refunds]
        CRDB[(CockroachDB
Operational Ledger
Mutable
90-day retention)]
    end

    subgraph PIPELINE["Event Pipeline"]
        KAFKA[Kafka Topic
financial-events
30-day retention
3x replication]
    end

    subgraph AUDIT["Immutable Audit Log"]
        CH_KAFKA[ClickHouse
Kafka Engine Table]
        CH_MV[Materialized View
Transform JSON]
        CH_STORAGE[(ClickHouse
MergeTree Storage
Immutable
7-year retention
Hash chaining)]
    end

    subgraph QUERY["Query Interfaces"]
        RECON[Daily Reconciliation Job
Automated 2AM UTC]
        METABASE[Metabase Dashboard
Finance Team]
        SQL[SQL Client
External Auditors]
        EXPORT[Parquet Export
Quarterly Audits]
    end

    BUDGET -->|Async publish
non-blocking| KAFKA
    BILLING -->|Async publish
non-blocking| KAFKA
    BUDGET -->|Sync write
3ms| CRDB
    BILLING -->|Sync write
5ms| CRDB

    KAFKA -->|Real-time consume
5s lag| CH_KAFKA
    CH_KAFKA --> CH_MV
    CH_MV --> CH_STORAGE

    RECON -.->|Query operational| CRDB
    RECON -.->|Query audit| CH_STORAGE
    METABASE -.->|Ad-hoc queries| CH_STORAGE
    SQL -.->|Read-only access| CH_STORAGE
    EXPORT -.->|Quarterly extract| CH_STORAGE

    style BUDGET fill:#e3f2fd
    style BILLING fill:#e3f2fd
    style CRDB fill:#fff3e0
    style KAFKA fill:#f3e5f5
    style CH_STORAGE fill:#e8f5e9
    style RECON fill:#ffebee

Event Pipeline Architecture

Event Flow: Budget Service and Billing Service emit structured financial events (budget deductions, impression charges, refunds, allocations) to Kafka financial-events topic asynchronously. Each event contains event type, campaign/advertiser IDs, amount, timestamp, and correlation IDs for traceability.

Kafka Buffer: Topic configured with 30-day retention (safety buffer during ClickHouse downtime), partitioned by campaignId for ordering guarantees, 3× replication for durability. Capacity: 100K events/sec (10% of platform QPS generating financial events).

ClickHouse Ingestion: Kafka Engine table consumes events directly, Materialized View transforms JSON into columnar schema optimized for analytics. MergeTree storage provides append-only immutability with automatic ZSTD compression (65% reduction). Ingestion lag: <5 seconds from event generation to queryable.

4. Audit Query Patterns

ClickHouse OLAP optimization enables sub-second queries for compliance scenarios:

Campaign Spend History (Tax Reporting): Aggregate all budget deductions for specific campaign over annual period. Common during tax filing season when advertisers request detailed spending breakdowns by campaign, geography, and time period. ClickHouse columnar storage and partition pruning enable sub-500ms queries across billions of events when filtering by campaign and time-range.

Dispute Investigation (Billing Accuracy): Trace complete event sequence for specific request ID when advertiser disputes charge. Requires chronological ordering of all events (budget deduction, impression charge, click attribution, refund if applicable) to reconstruct exact billing calculation. Bloom filter index on requestId enables <100ms single-request retrieval even across multi-year dataset.

Reconciliation Analysis (Data Integrity): Compare daily aggregate spend between operational ledger (CockroachDB) and audit log (ClickHouse) to detect discrepancies. Requires grouping by campaign with tolerance for rounding differences. ClickHouse materialized views pre-compute daily aggregates for instant reconciliation queries.

Compliance Audit Trail (SOX/Regulatory): External auditors query complete financial history for specific advertiser or time period. Requires filtering by advertiser ID, event type (budget allocations, deductions, refunds), and date range with multi-dimensional grouping. ClickHouse query performance remains sub-second for most audit scenarios due to partition pruning and columnar compression.

Query Access Control

Access Restriction Policy: Financial audit log is classified data with restricted access per Segregation of Duties (SOX compliance). Default access: NONE. Only designated roles below have explicit permissions:

Automated Systems:

Daily Reconciliation (Airflow service account): Compares operational vs audit ledger aggregates, alerts on variance >0.01 or >0.001%
Quarterly Export (scheduled job): Generates Parquet files with cryptographic hash verification for compliance audits

Finance Team: Read-only Metabase access (SSO auth, 30s timeout, 100K row limit). Authorized queries: campaign spend trends, refund analysis, advertiser billing summaries, budget utilization reports. Handles all billing dispute investigations requiring financial data access.

External Auditors: Temporary credentials (expire post-audit) with pre-approved query templates for: annual tax reporting, SOX compliance verification, advertiser reconciliation. Complex queries scheduled off-peak. All auditor activity logged separately for compliance record.

Break-Glass Access: Emergency investigation (data corruption, critical billing bug) requires VP Finance + VP Engineering approval, limited to 1-hour window, full session recording, mandatory post-incident compliance review.

ClickHouse Storage Design

MergeTree Configuration: Ordering key (campaignId, timestamp) optimizes campaign history queries. Monthly partitioning toYYYYMM(timestamp) enables efficient pruning for tax/annual reports. ZSTD compression achieves 65% reduction (200GB/day → 70GB/day). Bloom filter index on requestId enables <100ms dispute lookups.

Immutability Enforcement: MergeTree prohibits UPDATE/DELETE operations by design. Administrative changes require explicit ALTER TABLE DROP PARTITION (logged separately). Each row includes SHA-256 previousHash creating tamper-evident chain - modification breaks hash sequence, detected during quarterly verification.

Performance & Cost: Asynchronous write path (1-2ms Kafka publish, <5s ingestion lag) has zero operational latency impact. Query performance: <500ms simple aggregations, 1-3s complex analytics, <100ms dispute lookups. Storage: 180TB for 7-year retention (70GB/day × 2,555 days), approximately 15-20% of database infrastructure cost. Sub-second queries over billions of rows via columnar OLAP optimization.

Daily Reconciliation Process

Automated verification ensuring operational and audit ledgers remain synchronized. This process validates data integrity and detects system issues before they compound into billing disputes.

Reconciliation Job (Airflow DAG, scheduled 2:00 AM UTC daily):

Step 1: Extract Daily Aggregates from Both Systems

Query operational ledger (CockroachDB) and audit log (ClickHouse) for previous 24 hours, aggregating spend per campaign. Operational ledger contains real-time mutable data (90-day retention), while audit log contains immutable append-only events (7-year retention). Aggregation groups by campaign ID, summing budget deductions and impression charges while excluding refunds (handled separately).

Step 2: Compare Aggregates with Tolerance

Per-campaign validation accepts minor differences due to rounding and microsecond-level timing variations. Match tolerance set at 1 cent OR 0.001% of campaign total (whichever greater). For example, campaign with 10,000 spend allows up to 10 cents variance, while small campaign with 5 spend allows 1 cent variance. This tolerance accounts for floating-point rounding in budget calculations and clock skew between systems.

Step 3: Alert on Significant Discrepancies

P1 PagerDuty alert triggered when campaign variance exceeds threshold. Alert includes: affected campaign IDs, operational vs audit totals, percentage variance, and trend analysis (has this campaign had previous mismatches?). Dashboard visualization shows aggregate delta across all campaigns, enabling quick identification of systemic issues (e.g., Kafka consumer lag affecting all campaigns vs isolated campaign-specific bug).

Step 4: Forensic Investigation

Drill-down analysis retrieves complete event sequence for mismatched campaign from both systems. Event correlation matches operational ledger entries with audit log events by request ID to identify missing events (operational wrote but Kafka publish failed), duplicate events (retry caused double-write), or timing mismatches (event arrived after reconciliation window). Most common root causes:

Kafka lag (85% of discrepancies): Consumer backlog delays event ingestion >24 hours, resolves automatically when ClickHouse catches up
Schema mismatch (10%): Field rename in event schema without updating ClickHouse parser, requires parser fix and backfill
Event emission bug (5%): Edge case where service fails to publish event, requires code fix and manual backfill with audit justification

Step 5: Automated Resolution Tracking

Reconciliation job stores results in dedicated tracking table: campaign ID, discrepancy amount, detection timestamp, resolution status. Daily report summarizes: total campaigns reconciled, mismatch count, average variance, unresolved discrepancy age. Historical trend analysis detects degrading data quality (increasing mismatch rate signals systemic problem requiring investigation).

Historical Success Rate: 99.999%+ campaigns match daily (typically 0-3 discrepancies out of 10,000+ active campaigns). Most discrepancies resolve automatically within 24-48 hours as delayed Kafka events arrive. Only 1-2 cases per month require manual intervention (code bug fixes, schema corrections, or manual backfill with approval workflow).

Auction Mechanism Design

First-Price Auctions: Industry Standard for RTB

Since 2019, the programmatic advertising industry has standardized on first-price auctions for Real-Time Bidding (RTB) and display advertising. In a first-price auction, the winner pays their bid - not the second-highest bid.

Why First-Price Became Standard:

The industry shifted from second-price to first-price auctions to address transparency concerns and bid landscape visibility. Key drivers:

Header bidding transparency: Publishers could see all bids, making second-price manipulation visible
Simpler economics: “Winner pays bid” is easier to explain than second-price mechanisms
DSP preference: Major demand-side platforms (Google DV360, The Trade Desk) prefer first-price with bid shading
Revenue impact: First-price with bid shading generates 5-15% higher revenue in practice (theoretical revenue neutrality assumes perfect shading, but DSPs shade conservatively)

Auction Setup:

$N$ advertisers submit bids $b_1, b_2, \ldots, b_N$
Each ad has predicted CTR (Click-Through Rate): $\text{CTR}_1, \text{CTR}_2, \ldots, \text{CTR}_N$ - the probability a user clicks the ad when shown
Single ad slot to allocate

Effective Bid (eCPM - effective Cost Per Mille):

Advertisers use different pricing models - some pay per impression (CPM), others per click (CPC), others per conversion (CPA). To compare apples-to-apples, we convert all bids to eCPM: expected revenue per 1000 impressions.

The conversion formulas are as follows:

$$ \begin{array}{ll} \text{CPM bid:} & eCPM = CPM (direct) \\ \text{CPC bid:} & eCPM = CPC \times CTR \times 1000 \\ \text{CPA bid:} & eCPM = CPA \times conversion\_rate \times CTR \times 1000 \end{array} $$

This normalizes bids across pricing models: eCPM represents expected revenue per 1000 impressions, accounting for how likely users are to click.

Why this matters: A higher CPC bid with low CTR (5%) may earn less than a lower CPC bid with high CTR (15%). The platform maximizes revenue by selecting the highest eCPM, not highest raw bid.

Winner Selection:

$$w = \arg\max_{i \in [1,N]} \text{eCPM}_i$$

Price Determination (First-Price):

The winner pays their bid (not the second-highest bid):

$$p_w = b_w$$

This is fundamentally different from second-price auctions where winners paid just enough to beat the runner-up.

Example:

Advertiser	Bid	CTR	eCPM	Rank
A	B_a	0.10	B_a × 0.10 × 1000 = 100 × B_a	2
B	B_b	0.15	B_b × 0.15 × 1000 = 150 × B_b	1
C	B_c	0.05	B_c × 0.05 × 1000 = 50 × B_c	3

Winner: Advertiser B (highest eCPM multiplier: 150× vs 100× vs 50×)

Price paid by B in first-price auction: $$p_B = b_B = B_b$$

Advertiser B pays their full bid amount.

Comparison: Second-Price vs First-Price

In a second-price auction (historical approach), Advertiser B would have paid just enough to beat A’s eCPM (by a small increment). In first-price, they pay their full bid.

The Bid Shading Response:

First-price auctions incentivize bid shading - DSPs use machine learning to predict the minimum bid needed to win and bid slightly above that. This recovers much of the economic efficiency of second-price auctions while maintaining transparency. (See “Bid Shading in First-Price Auctions” section below for details.)

Quality Score and Ad Rank

Ads are ranked by eCPM = bid × CTR, but in practice ad quality also matters for user experience.

The Quality Problem:

Consider two advertisers:

Advertiser X: Higher bid, fast landing page, relevant ad copy → users happy
Advertiser Y: Slightly higher bid, slow landing page, misleading ad → users complain

Should Y win just because they bid more? This degrades user experience.

Google’s Solution: Quality Score

Since ~2005, Google Ads has incorporated Quality Score into auction ranking:

$$\text{Ad Rank} = \text{Bid} \times \text{Quality Score}$$

Quality Score Components (1-10 scale):

Google evaluates three components, though exact weights are not publicly disclosed:

Expected CTR (highest impact): Historical click-through rate for this keyword/ad combination
Landing Page Experience (highest impact): Page load speed, mobile-friendliness, content relevance, security (HTTPS)
Ad Relevance (moderate impact): How well ad text matches search query intent

Note: Research shows improving CTR or Landing Page Experience has roughly twice the impact of improving Ad Relevance. Focus optimization efforts on the top two components.

Modified Auction Ranking:

Instead of ranking by eCPM alone, rank by Ad Rank:

$$\text{Ad Rank}_i = b_i \times \text{CTR}_i \times \text{QualityScore}_i \times 1000$$

Example: Quality Beats Price

Advertiser	Bid	CTR	Quality Score	Ad Rank	Position	Winner?
X	B_low	0.15	10/10 (excellent)	Quality-adjusted eCPM_high	1	Yes
Y	B_high (40% higher)	0.15	6/10 (poor landing page)	Quality-adjusted eCPM_lower	2	No

Advertiser X wins despite lower raw bid because of higher quality (10/10 vs 6/10).

System Design Implications:

1. Data Pipeline Requirements:

Historical CTR tracking: Store click/impression data per advertiser-keyword pair
Landing page metrics: Collect page load times, bounce rates, mobile scores
Real-time signals: HTTPS status, page availability checks
Storage: Time-series database for CTR history, key-value store for current quality scores

2. Computation Architecture:

Quality Score is computed offline by ML model, cached, and served at auction time:

    
    graph
    subgraph "Offline Pipeline - Runs Daily/Weekly"
        direction BT
        CACHE_WRITE[Cache Update
Redis/Memcached
Atomic Swap]
        PREDICT[Quality Score Prediction
All Advertiser-Keyword Pairs
Millions of Combinations]
        TRAIN[ML Model Training
XGBoost/Neural Net
Hours of Batch Processing]
        HD[(Historical Data Store
Time-Series DB
Billions of Auction Events)]

        HD --> TRAIN
        TRAIN --> PREDICT
        PREDICT --> CACHE_WRITE
    end

    subgraph "Online Pipeline - Real-Time <100ms"
        direction TB
        AUCTION[Auction Request
User Query + Bids
N Advertisers]
        CACHE_LOOKUP{Cache Lookup
Redis Read
< 1ms}
        CACHE_HIT[Quality Score Retrieved
99%+ Hit Rate]
        CACHE_MISS[Cache Miss
Use Default Score = 7/10
< 1% Rate]
        COMPUTE[Compute Ad Rank
Bid × CTR × QualityScore
< 1ms]
        FIRST_PRICE[First-Price Auction
Rank & Select Winner
< 5ms]
        RESULT[Auction Result
Winner + Price
Click/Impression Event]

        AUCTION --> CACHE_LOOKUP
        CACHE_LOOKUP -->|Hit| CACHE_HIT
        CACHE_LOOKUP -->|Miss| CACHE_MISS
        CACHE_HIT --> COMPUTE
        CACHE_MISS --> COMPUTE
        COMPUTE --> FIRST_PRICE
        FIRST_PRICE --> RESULT
    end

    style HD fill:#e1f5ff
    style TRAIN fill:#e1f5ff
    style PREDICT fill:#e1f5ff
    style CACHE_WRITE fill:#e1f5ff
    style AUCTION fill:#fff4e1
    style CACHE_LOOKUP fill:#fffacd
    style CACHE_HIT fill:#d4edda
    style CACHE_MISS fill:#f8d7da
    style COMPUTE fill:#fff4e1
    style FIRST_PRICE fill:#fff4e1
    style RESULT fill:#fff4e1

3. Performance Considerations:

Latency impact: Quality score lookup adds ~0.5-1ms to auction (cache hit)
Cache warming: Pre-compute scores for active advertisers (99%+ hit rate)
Fallback: Default quality score (e.g., 7/10) if cache miss
Update frequency: Quality scores change slowly (update daily, not per-auction)

4. ML Model Deployment:

Training data: Billions of historical auctions (click events, landing page metrics)
Features: Ad-keyword relevance (NLP embeddings), historical CTR, page speed metrics
Model serving: Offline batch prediction, not real-time inference (too slow for auction latency)
A/B testing: Shadow scoring to test model changes before production

Relationship to First-Price Auctions:

Quality-adjusted first-price auctions work the same way:

Rank by: Bid × CTR × Quality Score (Ad Rank)
Pay: Your bid (first-price)

The quality score affects ranking (who wins) but not the fundamental pricing (winner pays bid). This encourages advertisers to improve landing pages, ad relevance, and user experience to achieve better ad positions at lower bids.

Computational Complexity

First-Price Auction Complexity:

Sort advertisers by eCPM: $O(N \log N)$
Select winner and compute price: $O(1)$ (winner pays bid - no second-price calculation needed)
Total: $O(N \log N)$

For $N = 50$ DSPs:

First-price: ~282 operations (sort + select)

Latency Impact:

At 5ms budget for auction logic:

First-price auction: easily achievable
Sorting 50 DSPs by eCPM: <1ms with optimized comparisons
Winner selection: <0.1ms (just pick highest eCPM)

Implementation Note: First-price auctions are computationally identical to second-price auctions (both O(N log N)). The difference is purely in pricing: first-price returns the winner’s bid, while second-price calculates the minimum bid needed to beat the runner-up.

Reserve Prices and Floor Prices

The Problem:

Without a reserve price (minimum bid), your auction might sell ad slots for very low prices when competition is low. Consider a scenario where only one advertiser bids far below market value for a premium slot - you’d rather show a house ad (promoting your own content) than sell it that cheaply.

What is a Reserve Price?

A reserve price $r$ is the minimum eCPM required to participate in the auction. If no bids exceed $r$, the impression is not sold (or filled with a house ad).

The Revenue Trade-off:

Setting the reserve price is a balancing act:

Reserve Price	What Happens	Example
Too low (0.25× market rate)	Sell almost all impressions, but accept low-value bids	95% fill rate × low avg eCPM = suboptimal revenue
Optimal (market rate)	Balance between fill rate and price	70% fill rate × good avg eCPM = optimal revenue
Too high (5× market rate)	Only premium bids qualify, but most impressions go unsold	20% fill rate × high avg eCPM = suboptimal revenue

Mathematical Formulation:

Expected revenue per impression with reserve price $r$:

$$\text{Revenue}(r) = r \times P(\text{bid} \geq r)$$

where $P(\text{bid} \geq r)$ is the probability that at least one bid exceeds the reserve.

Optimal Reserve Price:

Find $r^*$ that maximizes expected revenue. If bids follow a known distribution with CDF $F(v)$:

$$r^* = \arg\max_r \left[ r \times (1 - F(r)) \right]$$

Interpretation:

$r$ = revenue when impression sells
$(1 - F(r))$ = probability impression sells (fraction of bids above $r$)

Concrete Example:

Suppose historical bids range uniformly from zero to maximum bid B_max. What’s the optimal reserve?

For uniform distribution: $P(\text{bid} \geq r) = 1 - \frac{r}{10}$

Expected revenue: $$\text{Revenue}(r) = r \times \left(1 - \frac{r}{10}\right) = r - \frac{r^2}{10}$$

Maximize by taking derivative: $$\frac{d}{dr}\left(r - \frac{r^2}{10}\right) = 1 - \frac{2r}{10} = 0$$

$$r^* = \frac{B_{max}}{2}$$

Result: Optimal reserve is half the maximum bid value (when bids are uniformly distributed).

Practical Approach:

Rather than assuming a distribution, use empirical data:

Analyze historical bid distribution from past auctions
Simulate different reserve prices offline and estimate revenue impact
Run A/B tests with small traffic percentages to validate optimal reserve
Monitor fill rate vs. revenue trade-off continuously

Multi-Dimensional Reserve Prices:

In practice, reserve prices are often segmented by:

Geo: Higher reserves for premium markets (US/UK) vs. developing markets
Device: Mobile vs. desktop vs. CTV (Connected TV)
User segment: High-value users (purchase intent) vs. casual browsers
Time of day: Peak hours vs. off-peak
Inventory quality: Above-the-fold vs. below-the-fold

Implementation Note:

Reserve prices work identically in first-price and second-price auctions - they filter out bids below the threshold before ranking. The difference is only in what the winner pays (their bid vs. second-highest bid).

Bid Shading in First-Price Auctions

With first-price auctions, DSPs face a strategic problem: bidding true value guarantees zero profit (you pay exactly what the impression is worth to you). This creates the bid shading optimization problem.

The Bid Shading Problem:

In first-price auctions:

Bid too high: You win but overpay (negative ROI)
Bid too low: You lose to competitors (missed opportunity)
Optimal strategy: Bid just above the second-highest bidder (but you don’t know their bid!)

How Bid Shading Works:

DSPs use machine learning to predict the competitive landscape and bid strategically:

Collect historical data: Track wins, losses, and winning prices across millions of auctions
Build bid landscape model: For each impression context (user, publisher, time), predict:
- Probability of winning at price $p$: $P(\text{win} | \text{bid} = p)$
- Distribution of competitor bids
Optimize bid: Choose bid $b$ that maximizes expected profit:

$$b^* = \arg\max_b \left[ (v - b) \times P(\text{win} | b) \right]$$

where $v$ is the true value of the impression to the advertiser.

Example:

Suppose an advertiser values an impression at V_imp (based on predicted conversion rate). The bid landscape model predicts:

Bid V_imp: 90% win rate (no profit - paying true value)
Bid 0.80 × V_imp: 75% win rate (expected profit: 0.20 × V_imp × 75% = 0.15 × V_imp)
Bid 0.70 × V_imp: 60% win rate (expected profit: 0.30 × V_imp × 60% = 0.18 × V_imp)
Bid 0.60 × V_imp: 40% win rate (expected profit: 0.40 × V_imp × 40% = 0.16 × V_imp)

Optimal bid: 0.70 × V_imp (maximizes expected profit at 0.18 × V_imp per auction)

Why First-Price + Bid Shading ≈ Second-Price:

Bid shading recovers much of the economic efficiency of second-price auctions:

Second-price: Winner pays second-highest bid
First-price + shading: Winner bids slightly above predicted second-price

The small difference represents the DSP’s uncertainty about the competitive landscape. As bid landscape models improve, first-price with shading converges toward second-price revenue.

System Design Implications:

From the SSP (supply-side platform) perspective:

Expect strategic bidding: DSPs will NOT bid true value - this is intentional and economically efficient
Bid landscape opacity: Don’t share winning bid distributions (preserves auction integrity)
Revenue impact: First-price with bid shading can generate approximately 5-15% higher revenue than second-price in practice, though exact figures vary by market conditions and DSP sophistication. The revenue lift comes from imperfect bid shading - DSPs tend to shade conservatively to avoid losing auctions, resulting in slightly higher clearing prices.

Implementation Note: SSPs don’t implement bid shading - that’s the DSP’s responsibility. The SSP simply runs a first-price auction (rank by eCPM, winner pays bid). The complexity of bid optimization happens on the demand side.

Historical Context: Second-Price Auctions

Before 2019, the programmatic advertising industry primarily used second-price auctions (specifically, Generalized Second-Price or GSP auctions). Understanding this history helps explain design decisions in legacy systems and why the industry shifted to first-price.

Why Second-Price Was Popular (2000s-2018):

Theoretical elegance: Encouraged truthful bidding (in theory)
Simpler for advertisers: “Bid your true value” was easier to explain than bid shading
Google’s influence: Google Search Ads used GSP successfully, setting industry precedent
Established ecosystem: Bidding algorithms optimized for second-price dynamics

How Second-Price (GSP) Works:

In a second-price auction, the winner pays just enough to beat the second-highest bidder:

$$p_w = \frac{\text{eCPM}_{2nd}}{\text{CTR}_w \times 1000} + \epsilon$$

where $\epsilon$ is a small increment.

Example:

Advertiser	Bid	CTR	eCPM	Rank
A	B_a	0.10	100 × B_a	2
B	B_b	0.15	150 × B_b	1
C	B_c	0.05	50 × B_c	3

Winner: Advertiser B (highest eCPM multiplier: 150×)

Price paid by B in second-price: $$p_B = \frac{100 \times B_a}{0.15 \times 1000} = 0.67 \times B_a$$

Advertiser B only pays enough to beat A’s eCPM (not their full bid B_b).

Why the Industry Shifted to First-Price (2017-2019):

Several factors drove the migration:

Header bidding transparency: Publishers could see all bids simultaneously, making second-price “bid reduction” visible and contentious
Price floor manipulation: SSPs could manipulate second-price auctions by setting floors strategically
Complexity: Second-price pricing logic was opaque (“Why did I pay less than my bid?”)
DSP preference: Major DSPs (Google DV360, The Trade Desk) preferred first-price with their own bid shading
Revenue impact: First-price with bid shading generates 5-15% higher revenue in practice (DSPs shade conservatively)

Timeline:

2017: AppNexus (now Xandr) pioneered first-price for programmatic
2018: Google AdX announced transition to first-price
2019: Industry-wide shift complete - first-price became standard for RTB

GSP Still Used for Sponsored Search:

Google Search Ads, Microsoft Ads, and Amazon Sponsored Products still use GSP (second-price) because:

Established advertiser ecosystems
Different transparency requirements (no header bidding)
Decades of advertiser education and tooling
Network effects (switching cost too high)

Key Difference: Search vs. Display:

Auction Type	Used For	Pricing
GSP (Second-Price)	Sponsored search (Google Search Ads)	Winner pays second-highest + small increment
First-Price	Programmatic display/video/CTV (RTB)	Winner pays their bid

This blog focuses on first-price auctions because they are the modern standard for Real-Time Bidding (RTB) and programmatic display advertising - the architecture described in this document.

Budget Pacing: Distributed Spend Control

Architectural Driver: Financial Accuracy - Pre-allocation pattern with Redis atomic counters ensures budget consistency across regions. Max over-delivery bounded to 1% of daily budget (acceptable legal risk) while avoiding centralized bottleneck.

Problem: Advertisers set daily budgets (e.g., daily limit). In a distributed system serving 1M QPS, how do we prevent over-delivery without centralizing every spend decision?

Challenge:

Centralized approach (single database tracks spend):

Latency: ~10ms per spend check
Throughput bottleneck: ~100K QPS max
Single point of failure

Solution: Pre-Allocation with Periodic Reconciliation

    
    graph TD
    ADV[Advertiser X
Daily Budget: B_daily]

    ADV --> BUDGET[Atomic Pacing Service]

    BUDGET --> REDIS[(Redis
Atomic Counters)]
    BUDGET --> CRDB[(CockroachDB
Billing Ledger
HLC Timestamps)]

    BUDGET -->|Allocate amount_1| AS1[Ad Server 1]
    BUDGET -->|Allocate amount_2| AS2[Ad Server 2]
    BUDGET -->|Allocate amount_3| AS3[Ad Server 3]

    AS1 -->|Spent: S1
Return: unused_1| BUDGET
    AS2 -->|Spent: S2
Return: unused_2| BUDGET
    AS3 -->|Spent: S3
Return: unused_3| BUDGET

    BUDGET -->|Periodic reconciliation
HLC timestamped| CRDB

    TIMEOUT[Timeout Monitor
5min intervals] -.->|Release stale
allocations| REDIS

    REDIS -->|Budget < 10%| THROTTLE[Dynamic Throttle]
    THROTTLE -.->|Reduce allocation
size dynamically| BUDGET

    classDef server fill:#e3f2fd,stroke:#1976d2
    classDef budget fill:#fff3e0,stroke:#f57c00
    classDef advertiser fill:#e8f5e9,stroke:#4caf50

    class AS1,AS2,AS3 server
    class BUDGET,REDIS,CRDB,TIMEOUT,THROTTLE budget
    class ADV advertiser

How it works:

Atomic Pacing Service pre-allocates budget chunks to Ad Servers (variable allocation amounts)
Ad Servers spend from local allocation using Redis atomic counters (no coordination needed)
Periodic reconciliation (every 30 seconds): Ad Servers return unused budget to Atomic Pacing Service
CockroachDB records all spend events with HLC (Hybrid Logical Clock) timestamps for globally ordered audit trail
Timeout Monitor releases stale allocations after 5 minutes (handles server crashes)
Dynamic Throttle reduces allocation size when budget < 10% remaining (prevents over-delivery)

Budget Allocation Operations:

Allocation request (Ad Server requests budget chunk):

Operation: Atomically decrement global budget counter (deduct allocation amount)
Returns: Remaining budget or error if insufficient
Frequency: Every 30-60 seconds per Ad Server

Reconciliation (Ad Server returns unused budget):

Operation: Atomically increment global budget counter (return unused amount)
Returns: Updated budget total
Frequency: When allocation period expires or Ad Server scales down

Key Properties:

Atomic operations: DECRBY is atomic, prevents race conditions
No coordination latency: Each Ad Server decides locally
Bounded over-delivery: Maximum over-delivery = (# servers) × (allocation size)
Self-healing: Timeout monitor recovers from server failures

Mathematical Analysis:

Over-Delivery Bound:

Maximum over-delivery: $$\text{OverDelivery}_{max} = S \times A$$

where $S$ = number of servers, $A$ = allocation chunk size.

Example: 100 servers with allocation A each → max 100 × A over-delivery (10% of 1000 × A daily budget).

Mitigation: Dynamic allocation sizing.

When budget remaining drops below 10%: $$A_{new} = \frac{B_r}{S \times 10}$$

This reduces max over-delivery to ~1% of budget.

The Critical Path: Synchronous Budget Check (5ms)

The Missing Piece: The pre-allocation strategy above handles periodic budget allocation (every 30-60s), but doesn’t explain the per-request budget check that must happen on EVERY ad request at 1M QPS. This synchronous check is the critical path component that ensures financial accuracy while meeting the 150ms SLO.

The Challenge at 1M QPS:

Naive approach (query CockroachDB on every request):

Latency: 10-15ms per query (p99)
Result: Violates 150ms SLO (adds 10-15ms to critical path)
Throughput: Creates massive database contention

Solution: Bounded Micro-Ledger (BML) Architecture

The BML system provides three-tier budget enforcement that achieves both low latency (3-5ms) and financial accuracy (bounded overspend):

Three-Tier BML Architecture (Critical Financial Atomicity Mechanism):

Tier 1: Synchronous Budget Check (Redis Lua Script - 3ms)

Component: Atomic Pacing Service executes Lua script in Redis
Function: Check if current_spend + cost ≤ budget_limit + INACCURACY_BOUND
Critical Property: The INACCURACY_BOUND (typically 0.5-1% of budget_limit) is the mathematical guarantee that ensures ≤1% billing accuracy
Atomicity: Lua script runs single-threaded in Redis, preventing race conditions
Latency: 3ms avg (5ms p99) - fits within critical path budget

Tier 2: Asynchronous Delta Propagation (Redis → Kafka)

Component: Redis publishes spend deltas to Kafka topic
Function: Stream of spend events for audit trail and reconciliation
Frequency: Every 5 seconds per campaign or on cumulative threshold
Event format: {campaign_id, spend_delta, timestamp, transaction_id}
Implementation: After Lua script completes successfully, Atomic Pacing Service emits event to Kafka asynchronously (non-blocking, does not impact 3ms budget)

Tier 3: Reconciliation Processor (Flink/Kafka Streams → CockroachDB)

Component: Flink job consumes Kafka stream and batch-commits to CockroachDB
Function: Maintain strong-consistency ledger as source of truth
Batch window: 30-second aggregation window
Strong consistency: CockroachDB ACID transactions with HLC timestamps
Periodic sync: Every 60s, sync Redis counters from CockroachDB to correct drift

Why This Three-Tier Architecture is Required:

Tier 1 alone: Fast but lacks audit trail and drift correction
Tier 3 alone: Accurate but too slow (10-15ms) for 1M QPS critical path
Combined: 3ms latency + mathematical bounded overspend + immutable audit trail

    
    graph TB
    subgraph "Synchronous Tier (3ms - Critical Path)"
        REQ[Ad Request
1M QPS] --> AUCTION[Auction Selects Winner
Ad from Campaign X
Cost: C]
        AUCTION --> BML_CHECK{BML: Atomic
Check & Deduct}

        BML_CHECK -->|Budget OK| REDIS_LUA[Redis Lua Script
ATOMIC:
if spend+cost < limit+bound
  then deduct
Latency: 3ms]

        REDIS_LUA -->|SUCCESS| SERVE[Serve Ad
Revenue: C]
        BML_CHECK -->|Budget EXHAUSTED| NEXT[Try Next Bidder
or House Ad]
    end

    subgraph "Asynchronous Tier (Reconciliation)"
        REDIS_LUA -.->|Emit delta
every 5s| KAFKA[Kafka
Spend Events]
        KAFKA -.-> FLINK[Flink
Aggregate]
        FLINK -.->|Batch commit
every 30s| CRDB[(CockroachDB
Billing Ledger
Source of Truth)]
    end

    CRDB -.->|Periodic sync
every 60s| REDIS_LUA

    classDef critical fill:#ffcccc,stroke:#cc0000,stroke-width:2px
    classDef async fill:#ccffcc,stroke:#00cc00,stroke-dasharray: 5 5

    class REQ,AUCTION,BML_CHECK,REDIS_LUA,SERVE critical
    class KAFKA,FLINK,CRDB async

Bounded Micro-Ledger (BML) Components:

1. Synchronous Tier: Redis Atomic Counter (3ms Budget)

Purpose: Fast, atomic check-and-deduct for every ad request

Atomic Check-and-Deduct Algorithm:

The algorithm executes atomically within Redis (single-threaded, no concurrent modifications possible):

Inputs:

campaign_id: Which campaign to check
cost: Amount to spend for this ad impression (e.g., impression cost C)
inaccuracy_bound: Safety buffer to prevent unbounded overspend (typically 0.5-1% of budget)

Algorithm Steps:

Read current state from Redis hash for this campaign:
- current_spend: How much already spent today
- budget_limit: Daily budget cap
Calculate remaining budget:
- remaining = budget_limit - current_spend
Atomic decision: Check if spend is allowed
- CRITICAL CONDITION (Key to ≤1% billing accuracy):
```
current_spend + cost ≤ budget_limit + inaccuracy_bound
```
- If TRUE (budget available):
  - Increment spend counter by cost atomically
  - Return SUCCESS with new remaining budget
- If FALSE (budget exhausted):
  - Do NOT modify spend counter
  - Return BUDGET_EXHAUSTED with current remaining

Critical Design Property: The inaccuracy_bound parameter in the Lua script condition is the mathematical enforcement mechanism that guarantees ≤1% billing accuracy. By setting inaccuracy_bound = 0.01 × budget_limit, we ensure maximum overspend is bounded to 1% of daily budget. This is the ONLY way to achieve bounded financial accuracy while maintaining 3ms latency at 1M QPS.

Why This is Atomic:

Redis executes the entire algorithm as a single atomic operation (Lua script runs single-threaded). Even if 1,000 requests arrive simultaneously, Redis processes them serially one-at-a-time, guaranteeing no race conditions.

Key Properties:

Atomic: Lua script executes atomically in Redis (single-threaded execution)
Fast: 5ms p99 total latency (3ms script execution + 2ms network RTT)
Bounded inaccuracy: The inaccuracy_bound ($5) prevents unbounded overspend
High throughput: Redis handles 1M+ ops/sec per shard

2. Asynchronous Tier: Reconciliation to CockroachDB

Purpose: Periodic sync to source of truth for audit trail and accuracy

Reconciliation Process (Flink Stream Processing Job, runs every 30s):

Step 1: Aggregate Spending Deltas

Flink consumes spend events from Kafka stream
Groups events by campaign_id
Aggregates total spend per campaign over 30-second window
Example: Campaign 12345 spent $2.50 + $3.00 + $1.75 = $7.25 in this window

Step 2: Batch Commit to CockroachDB

Open distributed transaction across CockroachDB cluster
For each campaign with spending activity:
- Insert new spending record with HLC timestamp (for global ordering)
- If campaign record exists, increment cumulative spend counter
- If campaign record doesn’t exist, create new entry
Commit transaction atomically across all shards
CockroachDB ensures ACID guarantees and audit trail

Step 3: Sync Redis from Source of Truth (every 60s)

Query CockroachDB for true cumulative spend per campaign
Update Redis hash with authoritative spend values
Detect drift: if Redis and CockroachDB differ by >$50, alert operations team
This corrects any Redis cache inconsistencies (restarts, clock skew, missed events)

Why Two-Tier Works:

Redis: Fast but eventually consistent (acceptable for bounded inaccuracy)
CockroachDB: Slow but strongly consistent (source of truth for billing)
Reconciliation: Bridges the gap, keeping Redis approximately correct while maintaining perfect audit trail

3. Integration with Request Flow

The budget check sits in the Auction Logic phase:

Before:

Auction Logic (5ms): Sort by eCPM, select winner

After (with BML):

Auction Logic (8ms avg, 10ms p99):
- Sort by eCPM, select winner: 3ms
- Budget check (BML): 3ms avg (5ms p99) ← NEW
- Overhead: 2ms
- If budget OK: serve ad
- If budget exhausted: try next bidder (repeat check)

Updated Request Flow Timing:

Complete request path latency breakdown:

Component	Latency	Notes
Network + Gateway	15ms
User Profile	10ms
Integrity Check	5ms	Fraud detection (BEFORE RTB)
Feature Store	10ms
Ad Selection	15ms
ML Inference	40ms	(parallel execution)
RTB Auction	100ms	(parallel, critical path)
Auction + Budget Check	8ms	Budget enforcement
Response	5ms
Total	143ms	(5-7ms buffer to 150ms SLO)

Mathematical Proof: Bounded Overspend of $5 per Campaign

Theorem: Maximum overspend per campaign is bounded to the inaccuracy_bound value ($5).

Proof:

Define:

$B$ = Daily budget limit
$S(t)$ = Recorded spend at time $t$
$\Delta$ = Inaccuracy bound ($5)
$c_i$ = Cost of request $i$

The Lua script allows spend if: $$S(t) + c_i \leq B + \Delta$$

Worst case scenario: Multiple concurrent requests hit Redis simultaneously before the spend counter updates.

Maximum concurrent overshoot:

At most $\Delta$ dollars can be spent beyond the limit because:

Once $S(t) > B$, the Lua script rejects ALL future requests
The maximum “in-flight” spend that can sneak through is bounded by $\Delta$
Even if 1000 requests arrive at the exact same nanosecond, Redis executes Lua scripts serially

Mathematical upper bound:

$$\text{Total Spend} \leq B + \Delta$$

$$\text{Overspend} = \max(0, \text{Total Spend} - B) \leq \Delta = \$5$$

Practical example:

Campaign has $1000 daily budget with $5 inaccuracy bound:

True limit in Lua script: $1005
Maximum possible spend: $1005
Maximum overspend: $5 (0.5% of budget)
Legally acceptable under standard advertising contracts

Alternative Explanation: In-Flight Requests Model

The inaccuracy_bound parameter ($5) can also be derived from system characteristics rather than configured arbitrarily. This approach calculates the bound based on request latency and throughput.

Parameters:

$Q_{campaign}$ = Requests per second for this campaign (e.g., 1,000 QPS)
$T_{req}$ = Request latency (150ms P99)
$L$ = Average ad cost ($0.005 per impression)

In-flight requests calculation:

When a budget counter hits zero, there are already requests in-flight that checked the budget as “available”:

$$R_{inflight} = Q_{campaign} \times T_{req} = 1,000 \times 0.15 = 150 \text{ requests}$$

Maximum overspend from in-flight requests:

If all in-flight requests complete (worst case):

$$Overspend_{max} = R_{inflight} \times L = 150 \times \$0.005 = \$0.75$$

Connecting both models:

The inaccuracy_bound parameter ($5) provides 10× safety margin over the calculated in-flight overspend ($0.75):

Configuration parameter: inaccuracy_bound = $5 (set in Lua script)
Actual worst-case: ~$0.75 from in-flight requests
Why the gap?: Accounts for traffic bursts, retry storms, circuit breaker delays

Both models are valid:

inaccuracy_bound model: What we configure in the system (Lua script parameter)
In-flight requests model: Why that configuration is sufficient (derived from system behavior)

For typical campaigns ($1,000-$10,000 daily budgets), both approaches yield overspend ≤0.5%, meeting the ≤1% financial accuracy requirement.

4. Handling Reconciliation Drift

Problem: Redis counter drifts from CockroachDB source of truth due to:

Redis cache misses/restarts
Delayed reconciliation
Clock skew

Solution: Periodic Sync Procedure (runs every 60s):

Algorithm:

Query Source of Truth
- For each active campaign, query CockroachDB billing ledger
- Compute true cumulative spend: SUM(spend) WHERE campaign_id = X
- This is the authoritative value (immutable audit trail)
Update Redis Cache
- Write true spend value to Redis hash for this campaign
- Overwrite any stale or drifted value
- Redis now reflects accurate state from source of truth
Detect and Alert on Drift
- Read current Redis value before overwriting
- Calculate drift: |true_spend - redis_spend|
- If drift exceeds threshold ($50):
  - Alert operations team via PagerDuty
  - Log discrepancy for investigation
  - Common causes: Redis node restart, delayed reconciliation, split-brain scenario

Why Drift Happens:

Redis restarts: Counter resets to 0, reconciliation hasn’t caught up yet
Reconciliation lag: 30-60s delay between spend and CockroachDB commit
Network partition: Redis shard temporarily isolated from reconciliation stream

Why Drift is Acceptable:

Maximum drift bounded by reconciliation window: $X spent in 60s
For typical campaign ($1,000/day budget): 60s ≈ $0.70 at uniform pacing
Actual drift usually <$10 (well within $5 inaccuracy bound per transaction)
Periodic sync corrects drift before it accumulates

Failure Mode: Tier 3 Reconciliation Outage

If Flink job or Kafka become unavailable:

Tier 1 continues operating: Budget checks work normally (Redis is independent)
Impact: Audit trail writing to CockroachDB is paused
Detection: Periodic sync (60s) detects drift > $50, alerts operations team via PagerDuty
Recovery: When Flink recovers, processes backlog from Kafka (Kafka retains events for 7 days)
Maximum data loss: None - Kafka retention ensures event replay capability
Bounded risk: Redis continues enforcing spend limits, preventing unbounded overspend

This failure mode demonstrates graceful degradation: critical path (Tier 1) remains operational while audit trail temporarily lags. Financial accuracy is maintained via bounded inaccuracy, audit completeness is recovered via Kafka replay.

Why This Works at 1M QPS:

Sharding: Redis cluster sharded by campaign_id (100+ shards)
Per-shard throughput: 10K QPS per shard (well within Redis capacity)
Latency: Lua script execution: 1-3ms, network RTT: 1-2ms = 3-5ms total
Bounded inaccuracy: $5 overspend is legally acceptable (0.05-0.5% of typical campaign budgets)

Why CockroachDB Alone Doesn’t Work:

Latency: 10-15ms p99 (too slow for critical path)
Throughput: Would require complex sharding strategy
Contention: Hot campaigns would create write bottlenecks
Cost: 3× more expensive than Redis for high-frequency operations

Trade-offs:

Approach	Latency	Accuracy	Cost	Scalability
CockroachDB only	10-15ms (slow)	Perfect	High	Limited
Redis only	5ms	Bounded ($5)	Low	Excellent
BML (both tiers)	5ms	Bounded + audited	Medium	Excellent

Idempotency Protection: Defending Against Double-Debits (CRITICAL)

The Problem: Double-Debit Risk

The BML architecture above handles budget enforcement correctly during normal operation, but lacks defense against a critical failure scenario: message replay after service crashes.

Failure scenario:

Ad Server Orchestrator (AS) processes ad request, runs auction, selects winning ad
AS calls Atomic Pacing Service → Redis Lua script successfully debits campaign budget
AS crashes before sending response to client (network issue, pod restart, out-of-memory)
Client doesn’t receive response, retries the same request (standard retry behavior)
AS processes retry, runs auction again, debits budget AGAIN (double-debit for single impression)
Result: Double-debit violates ≤1% billing accuracy constraint

Why this violates financial integrity:

At 1M QPS with 0.1% retry rate: 1,000 retries/second (0.1% of total traffic)
Without idempotency protection: 100% of retries = double billing on that traffic segment
Impact magnitude: 0.1% traffic × 2× billing = +0.1% gross overbilling = systematic >10× violation of ≤1% accuracy constraint
Consequence: Catastrophic for advertiser trust, payment processor compliance, potential regulatory/legal liability

Solution: Idempotency Key Store

The Atomic Pacing Service must implement idempotent budget deductions using a Redis-backed idempotency key mechanism.

Architecture:

    
    graph TB
    REQ[Ad Request
client_request_id: abc123] --> AS[Ad Server Orchestrator]

    AS --> GEN[Generate Idempotency Key
UUID + Timestamp
Key: idem:campaign_X:abc123]

    GEN --> LUA[Redis Lua Script
Atomic Check-and-Set]

    LUA --> CHECK{Key exists?}

    CHECK -->|YES| CACHED[Return cached result
DEDUP: Budget NOT debited
Return previous debit amount]
    CHECK -->|NO| DEBIT[Debit budget: -$2.50
Store key with TTL=30s
Value: debit_amount=$2.50]

    CACHED --> RESP1[Return to client
Idempotent response]
    DEBIT --> RESP2[Return to client
Fresh debit]

    TTL[TTL Expiration
After 30 seconds] -.->|Auto-delete key| CLEANUP[Key removed
Prevents memory leak]

    style CHECK fill:#fff3e0,stroke:#f57c00
    style CACHED fill:#c8e6c9,stroke:#4caf50
    style DEBIT fill:#ffccbc,stroke:#ff5722

Implementation: Enhanced Redis Lua Script

The Lua script must perform atomic check-and-set to guarantee exactly-once semantics:

Enhanced Lua script logic:

The script performs atomic check-and-set operations in a single Redis transaction:

Check idempotency key: GET operation on the idempotency key
If key exists: Return cached result (deduplication - budget was already debited)
- Signals to caller: deduplicated=true, returns previous debit amount
- Critical: Budget is NOT debited again (exactly-once guarantee)
If key doesn’t exist:
- Check budget: current_spend + cost <= budget_limit + inaccuracy_bound
- If budget OK: Debit budget AND store idempotency key atomically
  - DECRBY operation: Deduct cost from budget counter
  - SETEX operation: Store idempotency key with TTL (30 seconds)
  - Key value contains: debit amount, timestamp, transaction metadata
- If budget exhausted: Return error (no debit, no key stored)

Idempotency Key Naming Convention:

Keys follow a hierarchical pattern for efficient sharding and collision prevention:

Pattern: idem:campaign_{campaign_id}:{client_request_id}_{timestamp_bucket}

Components:

Prefix (idem): Namespace for idempotency keys (separates from budget counters)
campaign_id: Ensures keys are scoped per campaign (enables Redis cluster sharding)
client_request_id: Unique identifier from client (UUID v4, trace ID, or request hash)
timestamp_bucket: Rounded timestamp (prevents collision across time windows)

Example: idem:campaign_12345:req_abc123_1704067200

Why this format works:

Sharding: Campaign ID in key prefix ensures same campaign’s keys route to same Redis shard
Uniqueness: Combination of campaign + request_id + timestamp eliminates collisions
Queryability: Pattern matching enables monitoring (SCAN idem:campaign_12345:*)

TTL Rationale (30 seconds):

Too short (5s): Client retries beyond TTL window → double-debit
Too long (5min): Memory waste, prevents legitimate repeat requests from same client
30s: Balances retry window coverage (typical client timeout: 5-15s, allows 2-3 retry attempts) with memory efficiency

Memory overhead:

Key size: ~80 bytes
Value size: ~20 bytes (debit amount + metadata)
Total per key: ~100 bytes
At 1M QPS with 0.1% retry rate: 1K keys/sec × 30s TTL = 30K active keys × 100 bytes = 3MB
Negligible compared to Redis capacity

Why Lua Script is Critical:

Redis Lua scripts provide atomic execution guarantee - the foundation of idempotency protection.

Without Lua (separate GET + DECRBY operations), race conditions are inevitable:

Thread A: GET key → not found
Thread B: GET key → not found (race window - both threads see “not found”)
Thread A: DECRBY budget
Thread B: DECRBY budget (double-debit! - both threads deduct)

Lua script runs single-threaded in Redis, eliminating race conditions:

Redis blocks all other operations while Lua script executes
GET + DECRBY + SETEX become a single atomic transaction
Industry standard: This pattern is used by Stripe, GitHub, Shopify for financial operations

Client-Side Requirements:

Idempotency requires client cooperation - the contract between client and server:

Generate stable request IDs: Client must use consistent ID for retries
- Use UUID v4 generated once per original request (industry standard: Stripe, PayPal, AWS use this pattern)
- Include in retry attempts: same request_id for all retries of original request
- Why stable IDs matter: Different ID on retry = treated as new request = double-debit
Include request ID in API call:
- HTTP header (recommended): X-Request-ID: abc123-def456-ghi789 (RFC 7231 standard)
- Or request body: request_id field in JSON payload
- Server must validate: Reject requests with missing/malformed IDs in strict mode
Retry policy with exponential backoff (prevents thundering herd):
- 1st retry: 100ms + random jitter (0-50ms)
- 2nd retry: 500ms + random jitter (0-250ms)
- 3rd retry: 2s + random jitter (0-1s)
- Max retries: 3 (total window: ~3s, well within 30s TTL)
- Jitter prevents: Synchronized retries from multiple clients overwhelming server

Edge Cases and Failure Modes:

Real-world systems must handle imperfect clients and infrastructure failures:

Case 1: Client doesn’t provide request_id (Legacy client or API misuse)

Server-side fallback: Generate deterministic ID from request hash
Formula: SHA256(campaign_id + user_id + ad_id + timestamp_bucket)
Behavior: Prevents same user clicking same ad within 30s window from duplicate debits
Trade-off: Different users clicking same ad will have different IDs (correct - these are genuinely different requests)
Best practice: Log missing-request-id events to track non-compliant clients

Case 2: Redis key expires during retry window (Timing edge case)

Scenario: Client retries >30s after original request
Frequency: Rare - requires extreme network delays or client hanging
Behavior: Treated as new request, budget debited again
Mitigation: Log as expired-key-retry for audit trail, monitor frequency
Acceptable risk: Client already timed out by app standards (5-15s), unlikely to complete transaction
Industry precedent: Stripe’s idempotency keys expire after 24 hours with same behavior

Case 3: Redis unavailable (Failover scenario)

Scenario: Redis cluster failover, network partition, or master election
Impact: Idempotency protection temporarily unavailable (<5s typical failover time)
Behavior: Requests processed without deduplication during failover window
Consequences: Small window of potential double-debits
Mitigation strategies:
- Monitor Redis availability, alert on failover events
- Circuit breaker: Reject requests during known Redis outages (trade availability for correctness)
- Post-hoc reconciliation: Detect duplicate transactions in audit trail, issue refunds
- Design decision: Accept <5s vulnerability window vs rejecting all traffic (99.9% availability = 43 minutes/month downtime acceptable)

Monitoring:

Track idempotency metrics:

Deduplication rate: deduplicated_requests / total_requests (expect: 0.1% from retries)
Key hit rate: Percentage of requests that hit existing keys (should match retry rate)
Key expiry before use: Keys that expire before retry arrives (should be rare)
Memory usage: Active idempotency keys (should stay <10MB)

Alerts:

P1: Deduplication rate > 1% (abnormal retry rate, possible client bug or attack)
P2: Key expiry rate > 5% (TTL too short, increase to 60s)

Industry Comparison: How This Matches Best Practices

Our idempotency design aligns with proven patterns from leading payment and financial platforms:

Aspect	Our Design	Stripe	AWS	PayPal	Industry Best Practice
Request ID Source	Client-generated UUID	Client-generated UUID	Client-generated UUID	Client-generated UUID	Client-controlled
ID Header	`X-Request-ID`	`Idempotency-Key`	`x-amz-idempotency-token`	Custom header	HTTP header
Storage	Redis (30s TTL)	Database (24h TTL)	DynamoDB (1h TTL)	Database (24h)	Persistent store with TTL
Atomicity	Lua script	Database transaction	DynamoDB ConditionExpression	Database transaction	Atomic check-and-set
Scope	Per campaign	Per API key	Per request type	Per merchant	Scoped to prevent conflicts
Retry behavior	Return cached result	Return cached result (HTTP 200)	Return cached result	Return cached result	Idempotent response
TTL rationale	30s (high-frequency)	24h (low-frequency)	1h (moderate)	24h (low-frequency)	Context-dependent

Why our TTL differs (30s vs industry’s 24h):

Request frequency: Ad serving = 1M QPS vs payments = 1K QPS (1000× higher volume)
Memory constraints: 30K active keys vs 86M keys (24h retention at our scale = 2.5GB memory)
Use case: Real-time ad auctions complete in <3s vs payment settlement in hours/days
Trade-off accepted: Small risk of late retries (>30s) vs memory efficiency at scale

Alternative approaches considered:

Database-backed idempotency (Stripe’s approach)
- Pros: Longer TTL (24h+), stronger durability guarantees
- Cons: 10-15ms latency (violates our 5ms budget), poor scalability at 1M QPS
- Decision: Rejected - latency unacceptable for critical path
DynamoDB with conditional writes (AWS approach)
- Pros: Managed service, strong consistency, regional replication
- Cons: 8ms p99 latency (vs Redis 3ms), higher cost ($1000/month vs Redis $200/month)
- Decision: Rejected - Redis already deployed for budget counters, reuse existing infrastructure
In-memory only (no persistence) (Dangerous pattern)
- Pros: Ultra-low latency (<1ms)
- Cons: Lost on server restart, no failover protection
- Decision: Rejected - violates financial integrity requirements

Why Redis + Lua is optimal for our use case:

Already deployed for budget counters (infrastructure reuse)
Sub-5ms latency fits critical path budget
Atomic operations via Lua scripts (proven pattern)
TTL-based cleanup (memory efficiency)
Cluster mode supports 1M+ QPS
Trade-off: Shorter TTL (30s) vs database approaches (24h), but acceptable for real-time auctions

Impact Assessment:

Without idempotency protection:

Retry rate: 1M QPS × 0.1% = 1K retries/sec (typical under load)
Assuming 10% race conditions cause double-debits: 100 billing errors/sec
Billing accuracy violation: 100/1M = 0.01% systematic overbilling rate
Consequence: 10× violation of ≤1% accuracy constraint → catastrophic for financial integrity

With idempotency protection:

Double-debits prevented: 100% of retry-induced billing errors eliminated
Implementation overhead: ~3MB Redis memory + 0.5ms latency (30s TTL × 1K keys/sec)
Operational cost: Negligible - adds 10% to existing Redis footprint
Business value: Prevents systematic billing violations that would be catastrophic for advertiser trust and payment processor compliance

ROI: Infinite - The implementation cost (minimal Redis overhead) is negligible compared to preventing systematic financial integrity violations that could result in platform-wide advertiser churn and regulatory liability.

Conclusion:

The Bounded Micro-Ledger architecture achieves the “impossible trinity” of:

Low latency (5ms budget check)
Financial accuracy (mathematically proven $5 max overspend + idempotency protection against double-debits)
High throughput (1M+ QPS)

Critical addition: Idempotency protection is non-negotiable for production deployment. Without it, the system violates financial integrity guarantees during routine failure scenarios (crashes, retries, network issues).

This is the only viable architecture for real-time budget pacing at scale while maintaining financial integrity.

Summary: Data Consistency Meets Revenue Optimization

This post explored the three critical data systems that enable real-time ad platforms to serve 1M+ QPS with sub-150ms latency while maintaining financial accuracy: distributed caching for fast reads, eCPM-based auctions for fair price comparison, and atomic budget control for spend accuracy.

Three Critical Systems:

1. Distributed Caching Architecture

Problem: Serve 1M QPS without overwhelming databases Solution: Two-tier cache architecture with database fallback

Layer	Technology	Latency	Use Case	Cache Hit Rate
L1 Cache	Caffeine (in-process)	0.001ms	Hot user profiles	60%
L2 Cache	Valkey (distributed)	5ms	Warm data, feature vectors	25%
Database	CockroachDB	20ms	Source of truth (cache miss)	15% of requests

Key decisions:

Cache-aside pattern: Application controls caching (vs cache-through)
TTL-based invalidation: 5min profiles, 1hour features (vs event-driven)
Write-through for financial: Budget updates bypass cache → database first
Read-heavy optimization: 95% read, 5% write workload

Performance impact:

85% cache hit rate (L1: 60% + L2: 25%)
15% database queries (cache miss)
Avg latency: $0.60 × 0.001ms + 0.25 × 5ms + 0.15 × 20ms = 4.25ms$
vs database-only: ~40-60ms average
10-15× latency reduction enables sub-10ms budget for User Profile and Feature Store

2. Auction Mechanism Design

Problem: Compare $10 CPM bid with $0.50 CPC bid - which is worth more?
Solution: eCPM normalization using CTR prediction

eCPM formula:

$$ \begin{array}{ll} \text{CPM bid:} & eCPM = \text{CPM (direct)}\\ \text{CPC bid:} & eCPM = \text{CPC} \times \text{CTR} \times 1000 \\ \text{CPA bid:} & eCPM = \text{CPA} \times conversion_{rate} \times \text{CTR} \times 1000 \end{array} $$

Example:

Ad A: $10 CPM → eCPM = $10
Ad B: $0.50 CPC, predicted CTR = 2% → eCPM = $0.50 × 0.02 × 1000 = $10
Fair competition: Both have equal expected revenue

Auction type decision: First-Price

Simplicity: Winner pays their bid (vs second-price complexity)
Transparency: Advertisers see exact costs
Revenue: DSPs bid conservatively, but combined with ML-scored internal inventory, captures full value
Industry trend: Programmatic advertising moved from second-price to first-price (2017-2019)

Latency: 3ms for auction logic (ranking + budget check excluded)

3. Budget Pacing: Bounded Micro-Ledger

Problem: Prevent budget overspend across 300 distributed ad servers without centralizing every spend decision

Solution: Bounded Micro-Ledger with Redis atomic counters (detailed in Budget Pacing: Distributed Spend Control)

Core Architecture:

Pre-allocation: Daily budget → allocate proportional hourly amounts to Redis counters
Atomic deduction: DECRBY campaign:123:budget <cost> (5ms p99)
Idempotency: Redis cache of request IDs prevents double-debits during retries
Reconciliation: Every 10min, compare Redis totals vs CockroachDB source of truth
Bounded overspend: Mathematical guarantee ≤0.1% per campaign (≤1% aggregate)

Why this works:

No centralized bottleneck: Redis distributed across regions
Atomic operations: DECRBY prevents race conditions
Low latency: 3ms avg, 5ms p99 (vs 50-100ms for distributed transactions)
Financial accuracy: Mathematically proven bounds using two complementary models:
- Configuration model: inaccuracy_bound parameter (e.g., $5) in Lua script
- Behavioral model: In-flight requests (150 req × $0.005 = $0.75 typical overspend)

Performance Impact:

Metric	Without Budget Pacing	With Bounded Micro-Ledger	Improvement
Latency	Centralized DB check (50-100ms)	Redis atomic counters (3ms avg, 5ms p99)	17-30× faster
Overspend	Unbounded (race conditions)	≤0.1% per campaign (mathematical guarantee)	Bounded
Availability	Single point of failure	Distributed Redis (multi-region)	No bottleneck

Key Trade-offs:

Redis over Memcached: +30% memory cost → atomic DECRBY prevents race conditions
Idempotency cache: +0.5ms latency, +500MB Redis → eliminates 100 billing errors/sec
Pre-allocation: +10min reconciliation overhead → enables distributed 3ms spend checks
Bounded inaccuracy: Accept ≤1% variance → avoid 50-100ms centralized DB latency

See detailed implementation for Lua scripts, reconciliation algorithms, idempotency protection, and mathematical proofs.

Introduction: Where Data Meets Revenue

Distributed Caching Architecture

Multi-Tier Cache Hierarchy

GDPR Right-to-Deletion Implementation

Cache Performance Analysis

Cache Cost Optimization: The Economic Tradeoff

Hit Rate Validation Methodology

Redis Cluster: Consistent Hashing and Sharding

Hot Partition Problem and Mitigation

Workload Isolation: Separating Batch from Serving Traffic

Cache Invalidation Strategies

Privacy-Preserving Attribution: SKAdNetwork & Privacy Sandbox

The Attribution Challenge

SKAdNetwork Postback Handling (iOS)

Privacy Sandbox Attribution Reporting API (Chrome)

Immutable Financial Audit Log: Compliance Architecture

The Compliance Gap

Solution: Dual-Ledger Architecture

Event Pipeline Architecture

Query Access Control

ClickHouse Storage Design

Daily Reconciliation Process

Auction Mechanism Design

First-Price Auctions: Industry Standard for RTB

Quality Score and Ad Rank

Computational Complexity

Reserve Prices and Floor Prices

Bid Shading in First-Price Auctions

Historical Context: Second-Price Auctions

Budget Pacing: Distributed Spend Control

The Critical Path: Synchronous Budget Check (5ms)

Idempotency Protection: Defending Against Double-Debits (CRITICAL)

Summary: Data Consistency Meets Revenue Optimization

1. Distributed Caching Architecture

2. Auction Mechanism Design

3. Budget Pacing: Bounded Micro-Ledger