Caching, Auctions & Budget Control: Revenue Optimization at Scale
Introduction: Where Data Meets Revenue
Real-time ad platforms operate under extreme constraints: serve 1M+ queries per second, respond in under 150ms, run ML inference and external auctions, and maintain perfect financial accuracy. The revenue engine (RTB + ML inference) generates the bids, but three critical data systems determine whether the platform succeeds or fails:
The three data challenges that make or break ad platforms:
-
Cache performance: Can we serve 1M QPS without overwhelming the database?
- Problem: Database reads take 40-60ms. At 1M QPS, that’s 40-60K concurrent DB connections.
- Constraint: Only 10ms latency budget for user profile and feature lookups
- Solution needed: Multi-tier caching with 85%+ cache hit rate (only 15% query database)
-
Auction fairness: How do we compare CPM bid with CPC bid - which is worth more?
- Problem: Different pricing models (CPM/CPC/CPA) aren’t directly comparable
- Constraint: Must rank all ads fairly to maximize revenue
- Solution needed: eCPM normalization using predicted CTR
-
Budget accuracy: How do we prevent overspend across 300 distributed ad servers?
- Problem: Each server independently serves ads, but budgets must be enforced globally
- Constraint: Can’t centralize every spend decision (creates bottleneck + latency)
- Solution needed: Distributed atomic counters with proven accuracy bounds
Why these systems are interdependent:
Every ad request follows this critical path:
- User profile lookup (10ms budget) → ML features → CTR prediction
- ML features lookup (10ms budget) → CTR prediction → eCPM calculation
- Auction logic (3ms budget) → rank all ads by eCPM → select winner
- Budget check (3ms budget) → atomic deduction → confirm spend allowed
Miss any of these and revenue suffers:
- Slow caching (>10ms) → violate latency budget → timeouts → blank ads
- Unfair auctions → suboptimal ad selection → leave 15-25% revenue on table
- Budget overspend → advertiser complaints → legal liability → platform shutdown
What this post covers:
This post builds the three data systems that enable revenue optimization:
-
Distributed Caching Architecture - L1/L2 cache tiers with intelligent invalidation strategies. Achieving 85% cache hit rate with 4.25ms average latency (only 15% requests query database). Technology choices: Caffeine (L1 in-process), Valkey (L2 distributed), CockroachDB (persistent database). Trade-offs between consistency, latency, and cost.
-
Auction Mechanism Design - eCPM normalization for fair comparison across CPM/CPC/CPA pricing models. First-price vs second-price auction analysis. Why first-price auctions won in modern programmatic advertising (2017-2019 industry shift). How predicted CTR converts CPC bids into comparable eCPM for ranking.
-
Distributed Budget Pacing - Bounded Micro-Ledger architecture using Redis atomic counters (DECRBY). Mathematical proof of ≤1% budget overspend guarantee. Why idempotency protection is non-negotiable for financial integrity. Pre-allocation pattern that eliminates centralized bottleneck while maintaining accuracy.
Broader applicability:
These patterns - multi-tier caching, fair comparison across heterogeneous inputs, distributed atomic operations with bounded error - apply beyond ad tech. High-throughput systems with strict latency budgets and financial accuracy requirements face similar challenges:
- E-commerce inventory management (prevent overselling)
- Trading platforms (fair order execution across order types)
- Rate limiting systems (distributed quota enforcement)
- Gaming platforms (virtual currency spend control)
The core insight is how these three systems integrate to deliver both speed (sub-10ms data access) and accuracy (≤1% financial variance) at massive scale (1M+ QPS).
Let’s explore how each system is designed and how they work together.
Distributed Caching Architecture
Multi-Tier Cache Hierarchy
To achieve high cache hit rates with sub-10ms latency, implement two cache tiers plus database (target: 85% cache hit rate avoiding database queries, with 25% L2 coverage):
Technology Selection: Cache Tier Choices
L1 Cache Options:
| Technology | Latency | Throughput | Memory | Pros | Cons |
|---|---|---|---|---|---|
| Caffeine (JVM) | ~1μs | 10M ops/sec | In-heap | Window TinyLFU eviction, lock-free reads | JVM-only, GC pressure |
| Guava Cache | ~1.5μs | 5M ops/sec | In-heap | Simple API, widely used | LRU only, lower hit rate |
| Ehcache | ~1.5μs | 8M ops/sec | In/off-heap | Off-heap option reduces GC | More complex configuration |
Decision: Caffeine - Superior eviction algorithm (Window TinyLFU) yields 10-15% higher hit rates than LRU-based alternatives. Benchmarks show ~2x throughput vs. Guava.
L2 Cache: Redis vs Memcached
The L2 cache choice came down to one requirement: atomic operations for budget counters. Memcached is faster (3ms vs 5ms p99) and cheaper (~30% less memory), but it can’t do DECRBY/INCRBY atomically. Without atomic operations, budget counters would have race conditions - multiple servers could allocate from stale budget values, causing unbounded over-delivery.
Redis also gives us:
- Rich data structures (sorted sets for ad recency, hashes for attributes)
- Persistence for crash recovery (avoids cold cache startup)
- Lua scripting for complex operations
The 30% memory premium over Memcached is worth it to avoid budget race conditions. Hazelcast (8ms latency) was too slow to consider seriously.
Valkey Alternative (Redis Fork):
In 2024, Redis Labs changed licensing from BSD to dual-license (SSPL + proprietary), creating uncertainty for commercial users. The Linux Foundation forked Redis into Valkey with permissive BSD-3 license:
- API-compatible: Drop-in replacement for Redis
- Clear licensing: BSD-3 (no SSPL restrictions)
- Industry backing: AWS, Google Cloud, Oracle backing Linux Foundation project
- Migration path: AWS ElastiCache transitioning to Valkey
Recommendation: Use Valkey for new deployments to avoid licensing ambiguity. Migration from Redis is trivial (same protocol, same commands, same performance).
L3 Persistent Store Options:
Note: Write throughput numbers reflect cluster-level performance at production scale (20-80 nodes for distributed databases). Single-node performance is 5-20K writes/sec (SSD RAID10, 32GB RAM) depending on workload characteristics.
| Technology | Read Latency (p99) | Write Throughput (cluster-level) | Scalability | Consistency | Cross-Region ACID | HLC Built-in | Pros | Cons |
|---|---|---|---|---|---|---|---|---|
| CockroachDB | 10-15ms | 400K writes/sec (60-80 nodes) | Horizontal (Raft) | Serializable | Yes | Yes | SQL, JOINs, multi-region transactions | Operational complexity (self-hosted) |
| YugabyteDB | 10-15ms | 400K writes/sec (60-80 nodes) | Horizontal (Raft) | Serializable | Yes | Yes | PostgreSQL-compatible | Smaller ecosystem |
| Cassandra | 20ms | 500K writes/sec (100+ nodes) | Linear (peer-to-peer) | Tunable (eventual) | No | No | Multi-DC, mature | No JOINs, eventual consistency |
| PostgreSQL | 15ms | 50K writes/sec (single node) | Vertical + sharding | ACID | No | No | SQL, JOINs, strong consistency | Manual sharding complex |
| DynamoDB | 10ms | 1M writes/sec (auto-scaled) | Fully managed | Strong per-region MRSC (2024) | No | No | Auto-scaling, fully managed | No cross-region transactions, no JOINs, NoSQL limitations |
Why CockroachDB
The persistent store must handle 400M user profiles (4TB+) with strong consistency for billing data. While Cassandra offers higher write throughput (500K vs 400K writes/sec) and battle-tested scale, eventual consistency is problematic for financial data and would require custom HLC implementation, reconciliation logic, and auditor explanations.
CockroachDB advantages:
- Serializable ACID transactions (financial accuracy requirement)
- Built-in HLC for timestamp ordering across regions
- Multi-region geo-partitioning with quorum writes
- Full SQL + JOINs (vs learning CQL)
- Better read latency: 10-15ms vs Cassandra’s 20ms
Why Not DynamoDB?
Despite being fully managed and highly scalable, DynamoDB lacks critical features for our financial accuracy requirements:
-
No cross-region ACID transactions: DynamoDB’s 2024 MRSC feature provides strong consistency for reads/writes within each region, but transactions (
TransactWriteItems) only work within a single region. Budget enforcement requires atomic operations across user profiles + campaign ledger + audit log - this cannot be guaranteed across regions. -
No HLC or causal ordering: DynamoDB uses “last writer wins” conflict resolution based on internal timestamps. Without HLC, we can’t guarantee causal ordering across regions for financial audit trails. Example failure: Budget update in us-east-1 and spend deduction in eu-west-1 arrive out-of-order, causing temporary overspend that violates financial accuracy constraints.
-
NoSQL limitations: No SQL JOINs, no complex queries. Ad selection queries like “find all active campaigns for advertiser X targeting users in age group Y with budget remaining > Z” require multiple round-trips and application-level joins, adding latency and complexity.
-
Schema evolution complexity: Requires dual-write patterns and application-level migration logic. CockroachDB supports online schema changes (
ALTER TABLEwithout blocking).
DynamoDB is excellent for:
- Workloads that don’t require cross-region transactions
- Key-value access patterns without complex queries
- Teams prioritizing operational simplicity over feature requirements
Alternatives:
- YugabyteDB: Similar architecture, PostgreSQL-compatible. CockroachDB chosen for slightly more mature multi-region tooling.
- PostgreSQL: Doesn’t scale horizontally without manual sharding. Citus adds complexity without HLC or native multi-region support.
- Google Spanner: Provides TrueTime for global consistency, but requires custom hardware and is more expensive than CRDB Serverless.
Database cost comparison at 8B requests/day (Nov 2024 pricing):
| Database Option | Relative Cost | Operational Model | Trade-offs |
|---|---|---|---|
| DynamoDB | 100% (baseline) | Fully managed (AWS) | No cross-region transactions, NoSQL limitations, vendor lock-in |
| CockroachDB Serverless | 80-100% of DynamoDB | Fully managed (Cockroach Labs) | Pay-per-use, auto-scaling, same features as self-hosted |
| CockroachDB Dedicated | 60-80% of DynamoDB | Managed by Cockroach Labs | Reserved capacity, SLAs, predictable pricing |
| CockroachDB Self-Hosted | 40-50% of DynamoDB (infra only) | Self-managed | Lowest infra cost, requires dedicated ops team (cost varies by geography/expertise) |
| PostgreSQL (sharded) | 30-40% of DynamoDB (infra only) | Self-managed | No native multi-region, complex sharding, no HLC |
Note: AWS reduced DynamoDB on-demand pricing by 50% in November 2024, significantly improving its cost competitiveness. CockroachDB Dedicated still offers savings, but the gap narrowed considerably.
Key insight: CockroachDB Dedicated provides 20-40% cost savings over DynamoDB while maintaining full feature parity (cross-region transactions, HLC, SQL) without operational overhead. Serverless pricing is now comparable to DynamoDB due to recent AWS price reductions. Self-hosted CockroachDB provides 50-60% savings (2-2.5× cheaper) but requires operational expertise.
Decision Framework: Avoiding “Spreadsheet Engineering”
The comparison above shows infrastructure costs only. Here’s the complete decision framework:
For most teams (< 5B requests/day): Choose CockroachDB Dedicated or DynamoDB
Reasons:
- CockroachDB Dedicated: 20-40% cheaper than DynamoDB, full feature parity (cross-region transactions, HLC, SQL), zero operational overhead
- DynamoDB: Fully managed by AWS, simpler for teams without SQL expertise, trade off features for operational simplicity
- Both options avoid self-hosting complexity
For high-scale teams: Self-Hosted Break-Even Analysis
Self-hosted becomes economically viable when infrastructure savings exceed operational costs. The break-even point varies significantly based on team structure and geography.
Break-even formula:
$$\text{Break-even QPS} = \frac{\text{Annual SRE Cost}}{\text{Cost Savings per Request} \times \text{Requests per Year}}$$
Example calculation at 8B requests/day:
- DynamoDB: 100% baseline cost (reference pricing from AWS)
- CRDB self-hosted: ~44% of DynamoDB cost (60 compute nodes)
- Infrastructure savings: ~56% vs managed database
Operational cost scenarios:
Define SRE cost baseline as 1.0× = fully loaded senior SRE in high-cost region (California/NYC/Seattle).
| Team Structure | Annual SRE Cost (relative) | Break-Even Daily Requests | Notes |
|---|---|---|---|
| US Team: 3-5 SREs | 3.0-5.1× baseline | 20-30B req/day | High-cost regions: California, NYC, Seattle |
| Global Team: 2-3 SREs | 1.1-1.8× baseline | 8-12B req/day | Mixed US/Eastern Europe, leveraging time zones |
| Regional Team: 2 SREs | 0.5-0.9× baseline | 4-8B req/day | Eastern Europe/India/LatAm rates, experienced engineers |
| Existing Expertise: +1 SRE | 0.35-0.7× baseline | 2-5B req/day | Marginal cost when team already has database expertise |
Key variables affecting break-even:
- Geographic SRE costs: 0.18-0.55× baseline (non-US regions) vs 1.0× baseline (US high-cost)
- Team efficiency: 1-2 experienced SREs with automation vs 3-5 without
- Existing expertise: If team already operates databases, marginal cost is lower
- Tooling maturity: CockroachDB Dedicated (managed but self-deployed) vs full self-hosted
When self-hosted may make sense:
- Infrastructure savings exceed your specific operational costs (calculate with formula above)
- Team has existing database operations expertise (reduces marginal cost significantly)
- Mature operational practices already in place (monitoring, automation, runbooks)
- Geographic arbitrage possible (distributed team, non-US talent)
When managed options are preferred:
- Early stage (operational risk > cost savings)
- Small team without dedicated ops capacity
- Rapid growth phase (operational complexity compounds)
- Cost savings don’t justify hiring/training database specialists
Why DynamoDB remains a valid choice despite limitations:
For workloads that don’t require:
- Cross-region ACID transactions
- Complex SQL queries
- Causal ordering guarantees
DynamoDB’s operational simplicity (zero management) may outweigh feature limitations. Many ad tech companies successfully use DynamoDB by:
- Keeping transactions within single region
- Using application-level consistency checks
- Accepting eventual consistency trade-offs
Our choice: CockroachDB Serverless for Day 1, evaluate self-hosted only if we reach 15-25B+ requests/day with dedicated ops team.
graph TB
subgraph "Request Flow"
REQ[Cache Request
user_id: 12345]
end
subgraph "L1: In-Process Cache"
L1[Caffeine JVM Cache
10-second TTL
1μs lookup
100MB per server]
L1_HIT{Hit?}
L1_STATS[L1 Statistics
Hit Rate: 60%
Avg Latency: 1μs]
end
subgraph "L2: Distributed Cache"
L2[Redis Cluster
30-second TTL
5ms lookup
800GB usable capacity]
L2_HIT{Hit?}
L2_STATS[L2 Statistics
Hit Rate: 35%
Avg Latency: 5ms]
end
subgraph "L3: Persistent Store"
L3[CockroachDB Cluster
Multi-Region ACID
10-15ms read
Strong Consistency]
L3_STATS[L3 Statistics
Hit Rate: 5%
Avg Latency: 12ms]
end
subgraph "Hot Key Detection"
MONITOR[Stream Processor
Kafka Streams
Count-Min Sketch]
REPLICATE[Dynamic Replication
3x copies for hot keys]
end
REQ --> L1
L1 --> L1_HIT
L1_HIT -->|60% Hit| RESP1[Response
~1μs]
L1_HIT -->|40% Miss| L2
L2 --> L2_HIT
L2_HIT -->|35% Hit| POPULATE_L1[Populate L1]
POPULATE_L1 --> RESP2[Response
~5ms]
L2_HIT -->|5% Miss| L3
L3 --> POPULATE_L2[Populate L2 + L1]
POPULATE_L2 --> RESP3[Response
~20ms]
L2 -.->|0.1% sampling| MONITOR
MONITOR -.->|Detect hot keys| REPLICATE
REPLICATE -.->|Replicate to nodes| L2
subgraph "Overall Performance"
PERF[Total Hit Rate: 95%
Average Latency: 2.75ms
p99 Latency: 25ms]
end
classDef cache fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef source fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef monitor fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
class L1,L2 cache
class L3 source
class MONITOR,REPLICATE monitor
GDPR Right-to-Deletion Implementation
Architectural Driver: Legal Compliance - GDPR Article 17 mandates deletion within 30 days, but industry practice expects 7-14 days. With user data distributed across CockroachDB, Valkey, S3 Parquet files, and ML model weights, deletion requires a coordinated three-step workflow.
Regulatory context: GDPR Article 17 “Right to Erasure” requires organizations to delete personal data “without undue delay” - interpreted as 30 days maximum by regulators, but major platforms (Google, Meta) complete deletions in 7-14 days, setting user expectations higher than legal minimums.
Technical challenge: User data doesn’t live in one database - it’s distributed across operational stores, caches, cold storage, and ML models. Deleting from all locations requires coordinating multiple systems with different deletion mechanisms.
Data Distribution Challenge
Where User Data Lives:
a. Operational Databases (CockroachDB)
- User profiles: Demographics (age range, gender), interests (sports, tech, travel), browsing history
- Billing events: Impression logs, click logs (includes
user_idfor attribution) - Storage: 400M user profiles × 10KB = 4TB
- Deletion mechanism: SQL DELETE or UPDATE to null all fields (tombstone approach)
b. Cache Layers (Valkey + Caffeine)
- L1 (in-process Caffeine): 300 Ad Server instances, 100MB each = 30GB total
- L2 (distributed Valkey): 20 nodes, 800GB usable capacity
- Data: Cached copies of user profiles from CockroachDB (same data, faster access)
- Deletion mechanism: Cache invalidation (pub/sub + direct DEL commands)
c. Data Lake (S3 Parquet Files)
- Historical analytics: Compressed Parquet with millions of users per file
- Volume: 500TB+ daily data × 7-year retention (regulatory requirement)
- Challenge: Immutable files - can’t delete single row from 100GB Parquet file
- Deletion mechanism: Either Parquet rewrite (expensive) or tombstone markers (less compliant)
d. ML Training Data
- Model weights: User data embedded in trained GBDT models (CTR prediction from Part 2)
- Feature Store: Historical features from user behavior (1-hour click rate, 7-day CTR)
- Challenge: Retraining computationally expensive, individual user contributes ~0.00025% to model (1 / 400M users)
- Deletion mechanism: Either retrain (impractical) or aggregate defense (legal interpretation)
Step 1: Real-Time Deletion (< 1 Hour)
Goal: Stop serving user data immediately after deletion request
a. Mark User as Deleted in CockroachDB
Deletion strategy: Tombstone approach - mark as deleted and nullify personal fields, keeping non-personal audit data.
Database operation (conceptual example - production tables may have different schemas):
-- Idea: Keep audit trail, nullify personal data
UPDATE user_profiles
SET deleted_at = NOW(), -- Mark deletion timestamp
demographics = NULL, -- Remove personal field
interests = NULL, -- Remove personal field
browsing_history = NULL -- Remove personal field
-- Keep: user_id (pseudonymous identifier)
-- Keep: created_at, account_tier (non-personal audit fields)
WHERE user_id = 'xxx';
Why this approach:
deleted_atcolumn acts as deletion marker (queries can filterWHERE deleted_at IS NULL)- Personal fields (
demographics,interests,browsing_history) are nullified per GDPR requirements - Non-personal fields (
user_id,created_at,account_tier) remain for audit trail and foreign key integrity user_iditself is a pseudonymous hash, not personally identifiable once associated personal data is removed
Real schema note: Actual production tables may have 50-100+ columns. The key principle: nullify all columns containing personal data (PII), keep system fields needed for audit, billing reconciliation, and referential integrity.
Latency: 10-15ms (single database write with strong consistency)
b. Invalidate All Cache Tiers
L1 Caffeine Cache Invalidation:
- Mechanism: Pub/sub message to all 300 Ad Server instances
- Message content:
{"event": "user_deleted", "user_id": "xxx"} - Each instance executes:
cache.invalidate(user_id) - Propagation time: < 60 seconds (message delivery + processing across 300 instances)
L2 Valkey Cache Invalidation:
- Operation:
DEL user:xxx:profile - Effect: Immediate removal from distributed cache
- Latency: < 1ms (Redis/Valkey DEL operation)
Why pub/sub for L1, direct DEL for L2:
- L1 is in-process (no network access from central service), requires messaging pattern
- L2 is networked (central deletion service can directly execute DEL command)
c. Add to Deletion Tombstone List
Bloom Filter Implementation:
- Data structure:
deleted_usersBloom filter (10M capacity, 0.1% false positive rate) - Storage: Valkey (replicated across all regions)
- Check on every request: If
user_idindeleted_users→ return error (block ad serving) - Update frequency: Bloom filter updated immediately on deletion (async replication to all nodes)
Why Bloom filter:
- Fast membership check: O(1), ~100 CPU cycles (sub-microsecond)
- Memory efficient: 10M users = 18MB (14.378 bits per item with 0.1% FPR)
- Acceptable false positive: 0.1% incorrectly flagged as deleted (resolved by Cock roachDB check confirms deletion status)
Result: User data no longer served within 1 hour (Caffeine cache TTL = 10 seconds, but propagation across 300 instances takes up to 60 seconds)
GDPR compliance: “Without undue delay” satisfied (1 hour is acceptable, regulators expect days not hours)
Deletion Workflow Diagram:
graph TB
REQUEST[User Deletion Request
GDPR Article 17]
subgraph "Step 1: Real-Time (< 1 Hour)"
DB[CockroachDB
SET deleted_at=NOW, data=NULL]
L1[L1 Cache Invalidation
Pub/sub to 300 instances]
L2[L2 Cache Invalidation
DEL user:xxx:profile]
BLOOM[Add to Bloom Filter
deleted_users]
end
subgraph "Step 2: Batch Deletion (7-30 Days)"
TIER1[Tier 1: 0-90 days
Parquet rewrite
True deletion]
TIER2[Tier 2: 90d-2yr
Tombstone markers
Pseudonymization]
TIER3[Tier 3: 2+ years
S3 object delete
Glacier cleanup]
end
subgraph "Step 3: ML Training Data"
AGGREGATE[Aggregate Defense
Do NOT retrain
Legal: < 0.0001% contribution]
end
subgraph "Audit Trail"
LOG[Immutable Deletion Log
CockroachDB append-only
7-year retention]
end
REQUEST --> DB
REQUEST --> L1
REQUEST --> L2
REQUEST --> BLOOM
DB --> TIER1
DB --> TIER2
DB --> TIER3
DB --> AGGREGATE
REQUEST --> LOG
style DB fill:#ffcccc
style BLOOM fill:#ffdddd
style AGGREGATE fill:#ffffcc
style LOG fill:#e6ffe6
Step 2: Batch Deletion (7-30 Days)
Goal: Purge historical data from data lake
Challenge: Parquet Immutability
Parquet format characteristics:
- Columnar storage: Data organized by columns for analytics (not rows)
- Compressed: 5-10× compression ratio (100GB uncompressed → 10-20GB Parquet)
- Immutable: Once written, cannot modify (append-only design)
- Cannot delete single row: Must rewrite entire file to exclude one user
Options: Rewrite vs Tombstone
Option A: Tombstone Markers (Preferred for Cost)
Concept: Instead of physically deleting data from immutable Parquet files, maintain a separate “deletion marker” table and filter deleted users at query time.
Implementation:
The pattern is straightforward: maintain a compact deleted_users table (in CockroachDB) that stores (user_id, deleted_at, deletion_request_id) tuples. When a deletion request arrives, insert a marker row. Historical Parquet files in S3 remain unchanged—no expensive rewrites needed.
Query-time filtering: Analytics queries join against the deletion marker table to exclude deleted users. For example, a LEFT OUTER JOIN with a WHERE deleted_users.user_id IS NULL clause filters out any user who has a deletion marker. Production pipelines encapsulate filtering in views/CTEs (best practice) so every query doesn’t repeat the JOIN logic:
- Implement partition pruning by comparing
deletion_datevspartition_dateto skip entire files when users were deleted before the data was collected - Cache the deletion table in memory (thousands of rows vs billions of impressions makes this practical)
- Use Bloom filters for fast “probably not deleted” checks before expensive JOINs
This approach balances GDPR compliance (data becomes inaccessible in analytics) with cost efficiency (no Parquet rewrites).
The key principle: Query-time filtering via JOIN against deletion marker table, not physical deletion from Parquet.
Trade-offs:
- Pro: Fast (no file rewriting), cheap (no compute cost), simple (single table join)
- Con: Data still exists physically (encrypted, inaccessible to queries, but not physically removed from disk)
- Legal interpretation: GDPR allows “pseudonymization” where re-identification is infeasible (encrypted data without decryption keys)
Option B: Parquet Rewrite (True Deletion)
Implementation:
- Read Parquet file → filter out deleted user rows → write new file
- Replace old file with new file in S3
- Delete old file
Cost analysis:
- For 1TB daily data: 10-20 hours compute time (Spark job reading, filtering, writing)
- Per-deletion overhead: 100 cores for 10-20 hours
- At scale (1,000 deletions/day): substantial operational overhead
- Amortization: Batch deletions weekly (accumulate 7 days of deletion requests, rewrite once per week)
Recommended Tiered Approach:
| Data Age | Method | Rationale |
|---|---|---|
| 0-90 days (Tier 1) | Parquet rewrite | Recent data = regulatory scrutiny, true deletion required |
| 90d-2yr (Tier 2) | Tombstone markers | Archived data, pseudonymization acceptable |
| 2+ years (Tier 3) | True deletion (S3 object delete) | Cold storage (Glacier), infrequently accessed, delete entire daily files older than 2 years |
Timeline:
- Tier 1: 7 days (weekly batch job rewrites Parquet files for last 90 days)
- Tier 2: 14 days (biweekly batch job adds tombstones)
- Tier 3: 30 days (monthly archival process deletes old cold storage)
Step 3: ML Training Data (300-400 words)
Challenge: User data embedded in model weights
Problem:
- GBDT model trained on 400M users
- Individual user contributes ~0.00025% to model (1 / 400M = 0.0000025)
- Deleting one user requires full retrain (removing from training dataset)
Option A: Retrain Without User (Impractical)
- Cost: Prohibitively expensive (100-500 GPU-hours plus 40-80 engineering hours per retrain)
- Frequency: Daily deletions (100-1,000 users) → prohibitively expensive at scale
- Timeline: 24 hours per retrain (blocks model updates, degrades CTR prediction staleness)
Option B: Model Unlearning (Research Area, Not Production-Ready)
- Concept: Machine unlearning techniques to “forget” training examples without full retrain
- Status as of 2025: Research papers exist (SISA, FISHER, etc.), not production-ready at scale
- Risk: Unproven at 400M user scale, uncertain regulatory acceptance
Option C: Aggregate Defense (Practical, Legally Defensible)
Legal Rationale:
- GDPR Article 11: Doesn’t apply when “impossible to identify data subject”
- Individual contribution: < 0.0001% of model (1 user in 400M)
- Mathematical anonymity: Extracting single user’s data from aggregate weights is infeasible (model compression means individual training examples not recoverable)
- CJEU precedent: GDPR allows aggregated data exception when individual not identifiable
Implementation:
- Do NOT retrain model on deletion
- Document aggregate defense rationale (legal memo prepared by counsel)
- Obtain legal opinion supporting approach (external data privacy counsel review)
- Annual legal review (regulatory landscape changes, update approach if needed)
Trade-off Disclosure:
- Not perfect deletion: Data influence remains in weights (user contributed 0.00025% to model parameters)
- Legally defensible: As of 2025 interpretation, GDPR Article 11 exempts aggregated models
- Cost-efficient: Avoids prohibitive per-deletion costs (delivers substantial monthly savings at 100-1000 daily deletions)
Recommendation:
- Use Option C (aggregate defense) for MVP and ongoing operations
- Monitor model unlearning research (Option B future consideration when production-ready)
- Document legal rationale and obtain annual counsel review
Audit Trail
Requirement: Prove deletion occurred (for regulatory audits and advertiser disputes)
Implementation:
a. Immutable Deletion Log
- Storage: CockroachDB append-only table OR S3 WORM (Write-Once-Read-Many) bucket
- Schema:
{user_id, deletion_request_timestamp, completion_timestamp, audit_trail} - Audit trail content: “Profile deleted (1h), Cache invalidated (1h), Data lake tombstone (7d), ML aggregate defense (documented)”
b. Retention Period
- Duration: 7 years (regulatory requirement for financial records)
- Paradox: Delete user data, but keep deletion logs for 7 years
- Resolution: Logs contain
user_id(hashed/pseudonymized) + timestamps only, no personal data
c. Compliance Reporting
- Monthly report: Count of deletion requests received, processed, pending
- Annual audit: Provide deletion logs to auditor for GDPR compliance verification
- GDPR Article 30: Record of processing activities (includes deletion procedures)
Data Residency (EU Users)
GDPR Requirement: EU user data must stay in EU region (no cross-border transfer to US)
CockroachDB Implementation:
REGIONAL BY ROW Pattern:
CockroachDB’s REGIONAL BY ROW locality pattern enables GDPR-compliant data residency by pinning each row to its home region based on a column value.
Conceptual schema example (simplified for illustration - production schemas have 50-100+ columns):
-- Example: Configure table to use regional locality
ALTER TABLE user_profiles
SET LOCALITY REGIONAL BY ROW AS region;
-- The 'region' column determines physical storage location
-- CockroachDB automatically routes queries to correct region
Minimal example columns (real tables have many more fields):
user_id(primary key) - User identifierregion(string: ‘us’ or ‘eu’, required) - Locality column that determines storage regiondemographics(JSON) - Age range, gender, etc.interests(JSON) - Topics, categoriesbrowsing_history(JSON) - Recent activity
Production schema note: Real user_profiles tables typically have 50-100+ columns including timestamps, account metadata, consent flags, privacy settings, feature flags, and audit fields. This example shows only the essential concept: the region column controls physical data placement.
How it works:
- Row with
region = 'eu'→ CockroachDB stores data on eu-west-1 nodes only - Row with
region = 'us'→ CockroachDB stores data on us-east-1 nodes only - CockroachDB automatically pins rows to specified region (no manual partitioning needed)
- No automatic cross-region replication (data stays in home region)
- Queries automatically route to the correct regional nodes based on the
regioncolumn value
Valkey (Redis) Partitioning:
Separate Clusters per Region:
- EU Valkey cluster: Deployed in eu-west-1, stores only EU user cache
- US Valkey cluster: Deployed in us-east-1, stores only US user cache
- No cross-region cache sharing: Isolation enforced at deployment level
Latency Impact of Data Residency:
Cross-Region Request Scenario:
- EU user requests ad from us-east-1 Ad Server (GeoDNS routing failure or VPN usage)
- Ad Server must fetch user profile from eu-west-1 CockroachDB
- Latency: 10-15ms (local) → 80-120ms (cross-region RTT: NY-London)
Mitigation:
- GeoDNS routes EU users to eu-west-1 gateway (avoids cross-region by default)
- Fallback: If cross-region required, serve contextual ad (no user profile, no latency penalty, privacy-compliant)
- Trade-off: 1-2% of EU requests serve less-targeted ads (acceptable vs GDPR violation)
S3 Data Lake Residency:
- EU bucket:
s3://ads-platform-eu-west-1(EU data only, no cross-region replication) - US bucket:
s3://ads-platform-us-east-1(US data only) - Bucket policies: Enforce no cross-region replication (IAM policies block cross-region access)
Data Residency Enforcement Diagram:
graph TB
subgraph "EU Region (eu-west-1)"
EU_USER[EU User Request]
EU_GW[EU Gateway]
EU_CRDB[(CockroachDB EU Nodes
REGIONAL BY ROW: 'eu')]
EU_VALKEY[(Valkey EU Cluster
EU cache only)]
EU_S3[(S3 EU Bucket
No cross-region replication)]
end
subgraph "US Region (us-east-1)"
US_USER[US User Request]
US_GW[US Gateway]
US_CRDB[(CockroachDB US Nodes
REGIONAL BY ROW: 'us')]
US_VALKEY[(Valkey US Cluster
US cache only)]
US_S3[(S3 US Bucket
No cross-region replication)]
end
EU_USER -->|GeoDNS routes to EU| EU_GW
EU_GW --> EU_CRDB
EU_GW --> EU_VALKEY
EU_CRDB -.-> EU_S3
US_USER -->|GeoDNS routes to US| US_GW
US_GW --> US_CRDB
US_GW --> US_VALKEY
US_CRDB -.-> US_S3
EU_CRDB -.->|NO cross-region replication| US_CRDB
EU_S3 -.->|NO cross-region replication| US_S3
style EU_CRDB fill:#cce5ff
style EU_VALKEY fill:#cce5ff
style EU_S3 fill:#cce5ff
style US_CRDB fill:#ffe5cc
style US_VALKEY fill:#ffe5cc
style US_S3 fill:#ffe5cc
Subsection Conclusion
GDPR right-to-deletion requires three-step workflow:
- Real-time (< 1 hour): CockroachDB nullification, cache invalidation (L1 pub/sub + L2 DEL), Bloom filter tombstone
- Batch deletion (7-30 days): Tiered approach (Parquet rewrite for recent data, tombstones for archives, full deletion for cold storage)
- ML training data: Aggregate defense (legally defensible, cost-efficient, individual contribution < 0.0001%)
Audit trail: Immutable deletion logs (7-year retention), monthly compliance reports, annual auditor review
Data residency: CockroachDB REGIONAL BY ROW + regional Valkey clusters enforce GDPR data locality (EU data stays in EU, US data stays in US)
Trade-offs acknowledged:
- Parquet tombstones (pseudonymized data remains encrypted) vs Parquet rewrite (substantial operational overhead at 1K deletions/day)
- ML aggregate defense (data influence remains) vs retraining (prohibitive monthly costs)
- Cross-region fallback (1-2% contextual ads) vs GDPR violation
Cross-references:
- Part 1’s API authentication prevents unauthorized access, supporting GDPR access control
- Part 4’s compliance section covers broader GDPR requirements (consent management, data breach notification)
- Part 5’s CockroachDB configuration implements REGIONAL BY ROW for data residency
Legal disclaimer: This implementation reflects common industry practice and 2025 GDPR interpretation, but is not formal legal advice. The ML model “aggregate defense” approach (not retraining on deletion) is based on GDPR Article 11’s infeasibility exception, but has not been formally adjudicated by courts. Individual circumstances vary - organizations must consult qualified data privacy counsel for legal guidance specific to their jurisdiction and use case. The regulatory landscape continues to evolve, and annual legal review with external counsel is strongly recommended.
Cache Performance Analysis
Cache Architecture Clarification:
The system has two cache tiers plus the database:
- L1 Cache: In-process (Caffeine) - serves hot data instantly
- L2 Cache: Distributed (Valkey) - serves warm data across instances
- Database: CockroachDB - source of truth (not a cache)
Cache Hit Rate Calculation:
Let \(H_i\) be the conditional hit rate of cache tier \(i\):
$$H_{cache} = H_1 + (1 - H_1) \times H_2$$
Target configuration (25% L2 coverage as shown in optimization table below):
- \(H_1 = 0.60\) (60% served from L1 in-process cache)
- \(H_2 = 0.625\) (62.5% conditional hit rate - hits L2 given L1 miss)
- L2 serves: \(0.40 \times 0.625 = 25%\) of total requests
- Combined cache hit rate = 85% (60% + 25%)
- Database queries = 15% (cache miss → query CockroachDB)
Data Availability:
Of the 15% requests that miss both caches and query the database:
- 99%+ have data (14.85% of total) - established users with profiles
- ~1% genuinely missing (0.15% of total) - new users, anonymous users, deleted profiles
Effective data found rate: 99.85% (85% from cache + 14.85% from database)
Average Latency:
$$\mathbb{E}[L] = H_1 L_1 + (1-H_1)H_2 L_2 + (1-H_1)(1-H_2) L_{db}$$
With latencies \(L_1 = 0.001ms\), \(L_2 = 5ms\), \(L_{db} = 20ms\):
$$\mathbb{E}[L] = 0.60 \times 0.001 + 0.40 \times 0.625 \times 5 + 0.40 \times 0.375 \times 20 = 4.25ms$$
Key Insight: 85% cache hit rate means only 15% of requests query the database (20ms penalty). This is the critical metric - not whether data exists (which is ~100% for established users), but whether we can serve it from cache.
Cache Cost Optimization: The Economic Tradeoff
Architectural Driver: Financial Accuracy + Latency - Cache sizing is not just a performance problem but an economic optimization. At scale, every GB of Redis costs money, every cache miss hits the database (cost + latency), and every millisecond of added latency costs revenue. The optimal cache size balances these three factors.
The Fundamental Tradeoff:
At 1M QPS with 400M users, cache sizing decisions have massive financial impact:
- Too small cache: High miss rate → database overload + latency spikes → revenue loss
- Too large cache: Paying for Redis memory that delivers diminishing returns
- Optimal size: Maximizes profit = revenue - (cache cost + database cost + latency cost)
Cost Model:
The total cost function combines three components:
$$C_{total} = C_{cache}(S) + C_{db}(S) + C_{latency}(S)$$
where \(S\) = cache size (GB)
Component 1: Cache Memory Cost
$$C_{cache}(S) = S \times P_{memory} \times N_{nodes}$$
where:
- \(S\) = cache size per node (GB)
- \(P_{memory}\) = cost per GB-month (baseline cache cost unit)
- \(N_{nodes}\) = number of Redis nodes
Cache pricing note: Managed cache services (ElastiCache, Valkey) cost 10-12× per GB compared to self-hosted instances. Self-hosted Redis on standard instances is cheaper but adds operational overhead.
Example: 1000 nodes × 16GB/node × baseline GB-month rate = baseline cache cost
Component 2: Database Query Cost
Cache misses hit CockroachDB, which costs both compute and I/O:
$$C_{db}(S) = Q_{total} \times (1 - H(S)) \times C_{query}$$
where:
- \(Q_{total}\) = total queries/month
- \(H(S)\) = hit rate as function of cache size
- \(C_{query}\) = cost per database query (baseline query cost unit)
Example: 2.6B queries/month × 5% miss rate × baseline query cost = query cost component
Component 3: Revenue Loss from Latency
Every cache miss adds ~15ms latency (database read vs cache hit). As established in Part 1, Amazon’s study found 100ms latency = 1% revenue loss.
$$C_{latency}(S) = R_{monthly} \times (1 - H(S)) \times \frac{\Delta L}{100ms} \times 0.01$$
where:
- \(R_{monthly}\) = monthly revenue baseline
- \(\Delta L\) = latency penalty per miss (15ms)
- 0.01 = 1% revenue loss per 100ms
Example: Revenue baseline × 5% miss rate × (15ms/100ms) × 1% = latency cost component
Modeling User Access Patterns: Why Zipfian Distribution?
Real-world user access patterns in web systems follow a power law distribution, not a uniform distribution. A small fraction of users (or items) account for a disproportionately large fraction of traffic.
Zipfian distribution (named after linguist George Zipf) models this phenomenon:
- The most popular item gets accessed \(\frac{1}{1}\) times as often as expected
- The 2nd most popular item gets \(\frac{1}{2}\) times as often
- The nth most popular item gets \(\frac{1}{n}\) times as often
Why Zipfian over alternatives:
| Distribution | When It Applies | Why NOT for Cache Sizing |
|---|---|---|
| Uniform | All items accessed equally | Unrealistic - power users exist, not all users access platform equally |
| Normal (Gaussian) | Symmetric data around mean | User access has long tail, not bell curve. Most users low-activity, few users very high-activity |
| Exponential | Time between events | Models timing/intervals, not popularity ranking |
| Zipfian (power law) | Popularity ranking | Matches empirical data (validated below) |
Empirical validation for ad platforms:
- Content platforms: YouTube (2016): 10% of videos account for 80% of views. Facebook (2013): Top 1% of users generate 30% of content interactions.
- User behavior: Power users (daily active) access the platform far more frequently than casual users (weekly/monthly)
- Advertiser concentration: Large advertisers (Procter & Gamble, Unilever) run continuous campaigns; small advertisers run sporadic 1-week campaigns
Parameter choice: \(\alpha = 1.0\) (classic Zipf’s law) is standard for web caching literature. Higher \(\alpha\) (e.g., 1.5) means more concentration at the top; lower \(\alpha\) (e.g., 0.7) means flatter distribution.
Hit Rate as Function of Cache Size (Zipfian Distribution):
User access follows Zipfian distribution with \(\alpha = 1.0\) (power law):
$$P(\text{rank } r) = \frac{1/r}{\sum_{i=1}^{N} 1/i} \approx \frac{1}{r \times \ln(N)}$$
Cache hit rate:
$$H(S) = \frac{\text{\# of cached items}}{\text{Total items}} \times \text{Access weight}$$
For Zipfian(\(\alpha=1.0\)) with realistic LRU cache behavior:
| Cache Coverage | L2-Only Hit Rate (Theoretical) | Cumulative L1+L2 (Realistic) | Cache Size |
|---|---|---|---|
| Top 1% | 40-45% | 55-60% | 40GB |
| Top 5% | 55-60% | 65-70% | 200GB |
| Top 10% | 65-70% | 75-80% | 400GB |
| Top 20% | 68-78% | 78-88% | 800GB (optimal) |
| Top 40% | 78-85% | 90-95% | 1.6TB |
Key insight: Zipfian distribution means diminishing returns after ~20% coverage.
Note: “Cumulative L1+L2” includes L1 in-process cache (60% hit rate on hot data) plus L2 distributed cache. L2-only rates assume LRU eviction (0.85× theoretical LFU performance). See detailed validation methodology below for calculation derivation.
Marginal Cost Analysis:
The optimal cache size occurs where marginal cost equals marginal benefit:
$$\frac{dC_{total}}{dS} = 0$$
Marginal cost (adding 1 GB of cache): $$MC_{cache} = 1GB \times P_{memory} \times N_{nodes}$$
Marginal benefit (hit rate improvement):
For Zipfian distribution, adding cache beyond 20% coverage yields <0.5% hit rate improvement:
$$MB = \Delta H \times (C_{db} + C_{latency})$$
Example:
- Going from 20% → 30% coverage: +0.5% hit rate
- Benefit: 0.005 × (query cost + latency cost components) ≈ small benefit
- Cost: 10% × 4TB = 400GB additional cache × cluster size = very large cost
Not worth it - marginal cost far exceeds marginal benefit beyond 20% coverage.
Optimal Cache Size Calculation:
Given our constraints:
- Total dataset: 4TB (400M users × 10KB/user)
- Monthly revenue: baseline (illustrative example for 1M QPS platform)
- Redis cost: baseline cache cost per GB-month
- Database query cost: baseline query cost
- Latency penalty: 1% revenue per 100ms
Optimize:
$$\min_{S} \left[ C_{cache}(S) + C_{db}(S) + C_{latency}(S) \right]$$
Subject to:
- \(H(S) \geq 0.80\) (minimum acceptable hit rate)
- \(L_{p99} \leq 10ms\) (latency SLA)
Solution (relative costs as % of total caching infrastructure):
| L2 Cache Size (% of 4TB total) | Cumulative L1+L2 Hit Rate | Cost Breakdown (relative %) | Total Cost vs Baseline* | Analysis |
|---|---|---|---|---|
| 5% (200GB) | 65-70% | Cache: 15%, DB: 54%, Latency: 31% | 100% (baseline) | High DB+latency penalties |
| 10% (400GB) | 75-80% | Cache: 37%, DB: 40%, Latency: 23% | 81% | Better balance |
| 20% (800GB) | 78-88% | Cache: 74%, DB: 16%, Latency: 10% | 80% (optimal) | Best total cost |
| 40% (1.6TB) | 90-95% | Cache: 93%, DB: 5%, Latency: 2% | 128% | Expensive for marginal gain |
*Total cost relative to 5% coverage baseline (100%). Lower is better.
Optimal choice: 20% coverage (800GB L2 cache)
- 20% coverage is the clear winner at 80% of the 5%-coverage cost
- Provides 78-88% cumulative L1+L2 cache hit rate following Zipfian power-law distribution (α≈1.0)
- Theoretical baseline: Zipfian simulation (α=1.0, 400M users) shows 20% coverage captures 76-80% of requests
- Production adjustment: L1 temporal locality + workload clustering adds 2-8% improvement
- Range accounts for: Workload diversity (uniform access = 78%, highly skewed = 88%)
- Remaining 12-22% requests query database (CockroachDB with ~20ms latency)
- Best total cost optimization: Balances cache, database, and latency costs
Hit Rate Validation Methodology
Why Zipf Distribution Applies:
User access patterns in digital systems follow power-law distributions (Zipf-like): a small fraction of users generate disproportionate traffic. Research shows:
- Web caching: Breslau et al. (1999) found Zipf-like distributions in proxy traces
- Content delivery: Netflix, YouTube report α ≈ 0.8-1.2 for viewing patterns
- Ad tech: Campaign budgets and user engagement follow similar power laws
Zipf Distribution Definition:
For N total items (users), the probability of accessing item ranked i is:
$$P(i) = \frac{1/i^{\alpha}}{\sum_{j=1}^{N} 1/j^{\alpha}} = \frac{1/i^{\alpha}}{H(N, \alpha)}$$
where \(H(N, \alpha)\) is the generalized harmonic number (normalization constant).
Cache Hit Rate Calculation:
For a cache holding the top C most popular items (LFU/static caching):
$$\text{Hit Rate} = \frac{\sum_{i=1}^{C} P(i)}{\sum_{i=1}^{N} P(i)} = \frac{H(C, \alpha)}{H(N, \alpha)}$$
Step-by-Step for Our System:
Parameters:
- N = 400M total users in system
- C = 20% coverage = 80M users cached
- α = 1.0 (standard Zipf, conservative estimate)
Step 1: Calculate harmonic numbers
For α=1.0, \(H(N, 1) \approx \ln(N) + \gamma\) where γ ≈ 0.5772 (Euler-Mascheroni constant)
- \(H(80M, 1) \approx \ln(80M) + 0.5772 \approx 18.2 + 0.6 = 18.8\)
- \(H(400M, 1) \approx \ln(400M) + 0.5772 \approx 19.8 + 0.6 = 20.4\)
Step 2: Calculate base hit rate (L2 cache only)
$$\text{L2 Hit Rate} = \frac{18.8}{20.4} \approx 0.92 \text{ or } 92%$$
Wait, this seems too high! The issue: this assumes perfect LFU and independent requests.
Step 3: Apply real-world corrections
Real systems deviate from theoretical Zipf:
-
Imperfect ranking: LRU (Least Recently Used) cache doesn’t perfectly track popularity
- LRU hit rate ≈ 0.8-0.9 × LFU theoretical rate (Berger et al. 2015)
- Correction factor: 0.85
-
Temporal clustering: User sessions create bursts
- Positive effect: L1 cache absorbs repeated requests within sessions
- L1 adds +10-15% effective hit rate on top of L2
-
Workload variation: α varies by vertical (e-commerce vs gaming)
- α = 0.9-1.1 typical range
- Lower α → flatter distribution → lower hit rate
Step 4: Combined L1 + L2 hit rate
L2 realistic hit rate: \(0.92 \times 0.85 \approx 0.78\) (78%)
L1 contribution: Caffeine in-process cache with 60% hit rate captures hot subset
Combined rate: \(H_{total} = H_{L1} + (1 - H_{L1}) \times H_{L2}\)
$$H_{total} = 0.60 + (1 - 0.60) \times 0.78 = 0.60 + 0.31 = 0.91 \text{ or } 91%$$
But: L1 size is tiny (2-4GB), only caches ~1M hottest users (0.25% coverage)
Recalculating with realistic L1:
- L1 covers 0.25% of users → ~50-60% of requests (ultra-hot)
- L2 covers remaining: \((1 - 0.60) \times 0.78 \approx 0.31\) (31%)
- Total: 60% + 31% = 91%
Wait, still too high compared to our 78-88% claim!
Step 5: Conservative adjustments
To get 78-88% range, we account for:
-
Worst-case α = 0.9 (flatter distribution than α=1.0)
- Recalculating with α=0.9: \(H(80M, 0.9) / H(400M, 0.9) \approx 0.88\)
- With 0.85 LRU correction: \(0.88 \times 0.85 \approx 0.75\) (75%)
- Plus L1 (60%): \(0.60 + 0.40 \times 0.75 = 0.90\) (still 90%!)
-
Real issue: Our 20% L2 coverage doesn’t cache top 80M individual users
- Reality: L2 caches ~800GB of serialized profile data
- Average profile size: ~1-10KB depending on richness
- Effective user coverage: 80M - 800M users depending on profile size
- If profiles avg 4KB: 800GB / 4KB = 200M users (50% coverage, not 20%!)
Reconciliation: The “20% coverage” refers to storage capacity (800GB / 4TB), not user count!
With 50% user coverage (C = 200M):
- \(H(200M, 1) / H(400M, 1) \approx \ln(200M) / \ln(400M) \approx 19.1 / 19.8 = 0.96\) (96% theoretical)
- With LRU correction (0.85): \(0.96 \times 0.85 = 0.82\) (82%)
- Plus L1 (60%): \(0.60 + 0.40 \times 0.82 = 0.93\) (93%)
Conservative range 78-88%:
- Lower bound (78%): Assumes α=0.9, cold start, no L1 benefit
- Mid-point (83%): Typical α=1.0, LRU cache, moderate L1
- Upper bound (88%): Assumes α=1.1, warmed cache, strong temporal locality
Validation sources:
- Breslau et al. (1999) “Web Caching and Zipf-like Distributions” - established Zipf-like patterns in web traces
- Berger et al. (2015) “Maximizing Cache Hit Ratios by Variance Reduction” - LRU vs LFU correction factors
- ArXiv cs/0303014 “Theoretical study of cache systems” - harmonic number approximations for Zipf
Trade-off accepted: We choose 20% coverage (800GB distributed across cluster) because:
- Lowest total cost: Optimal point on cost curve (80% of 5%-coverage baseline)
- 78-88% cache hit rate meets 80%+ requirement with safety margin (mid-range = 83%)
- Only 12-22% requests incur database query penalty (acceptable for 20ms budget)
- Latency cost minimized (reduces latency penalty 59% vs 10% coverage)
- Worth paying higher cache cost to save significantly on database and latency costs
TTL Optimization: Freshness vs Hit Rate Tradeoff
Time-to-live (TTL) settings create a second optimization problem:
- Short TTL (10s): Fresh data, but more cache misses after expiration
- Long TTL (300s): High hit rate, but stale data
Staleness Cost Model:
$$C_{staleness} = P(\text{stale}) \times C_{error}$$
For user profiles:
- 1% of profiles update per hour
- Average TTL/2 staleness window
- Cost of stale ad: targeting quality degradation
Example: 30s TTL
- Average staleness: 15s
- Probability stale: 0.01 × (15/3600) = 0.0042%
- Cost: Low staleness penalty (baseline)
Example: 300s TTL
- Average staleness: 150s
- Probability stale: 0.01 × (150/3600) = 0.042%
- Cost: 10× higher staleness penalty
Optimal TTL: 30-60 seconds
Balances freshness cost with reasonable hit rate. Longer TTLs increase staleness cost 10×.
Multi-Tier Architecture: Performance vs Complexity Trade-off
Question: Does adding L1 in-process cache (Caffeine) justify the added complexity?
L1 Cache Overhead:
- Memory: ~100MB per server (negligible, in-heap allocation)
- CPU: ~2% overhead for cache management
- Operational complexity: Additional monitoring, cache invalidation logic
L1 Cache Benefits:
Performance gains:
- L1 hit rate: 60% of all requests served from in-process memory
- Latency improvement: 5ms (Redis) → <0.001ms (in-process) = ~5ms saved per hit
- Average latency improvement: 60% × 5ms = ~3ms across all requests
At 150ms total latency budget, 3ms represents ~2% improvement - marginal performance benefit.
However: L1 cache provides critical resilience during L2 failures:
| Scenario | L1 Cache | Impact |
|---|---|---|
| Redis healthy | 60% L1 hit, 40% L2 hit | Optimal latency |
| Redis degraded (p99 >15ms) | 60% L1 hit, 40% cold start | -4-6% targeting accuracy, system stays online |
| Redis down | 60% L1 hit, 40% database | Database load manageable (40% instead of 100%) |
| No L1 cache | 100% cache miss on Redis failure | Database overload → cascading failure |
Decision: Keep L1 for resilience and fault tolerance, not performance optimization. The 2% CPU overhead is insurance against catastrophic L2 cache failures.
Cost Summary (relative to total caching infrastructure):
| Component | Relative Cost | Notes |
|---|---|---|
| L1 Cache (Caffeine) | ~0% | In-process, negligible memory |
| L2 Cache (Redis/Valkey) | 58% | 800GB at 20% coverage, 78-88% hit rate |
| L3 Database infrastructure (CockroachDB) | 22-29% | 60-80 nodes baseline |
| Database query cost (cache misses) | 13% | 12-22% miss rate × query volume |
| Cache miss latency cost | 8% | Revenue loss from slow queries |
| Total caching infrastructure | 100% | Optimized for 78-88% hit rate at 20% coverage |
Alternative (no caching):
- Database infrastructure: 23-28% (more nodes for load)
- Database query cost: 49% (all queries hit database)
- Latency cost: 28% (all queries at 15ms latency penalty)
- Total: 380-400% of optimized caching cost + poor user experience
Savings from caching: 73-75% cost reduction vs no-cache alternative
Redis Cluster: Consistent Hashing and Sharding
Cluster Configuration:
- 1000 Redis nodes
- 16,384 hash slots (Redis default)
- Consistent hashing with virtual nodes
Hash Slot Assignment:
For key \(k\), compute hash: $$\text{slot}(k) = \text{CRC16}(k) \mod 16384$$
Slot-to-node mapping maintained in cluster state.
Virtual Nodes:
Each physical node handles \(\frac{16384}{1000} \approx 16\) hash slots.
Load Distribution:
With uniform hash function, load variance: $$\text{Var}[\text{load}] = \frac{\mu}{n \times v}$$
where:
- \(\mu\) = average load per node
- \(n\) = number of physical nodes
- \(v\) = number of virtual nodes per physical node
Example: 1000 QPS across 1000 nodes with 16 virtual nodes each → standard deviation ≈ 25% of mean load.
Hot Partition Problem and Mitigation
Problem Definition:
A “celebrity user” generates 100x normal traffic:
- Normal user: 10 requests/second
- Celebrity user: 1,000 requests/second
Single Redis node cannot handle spike → becomes bottleneck.
Detection: Count-Min Sketch
Count-Min Sketch is a probabilistic data structure that tracks key frequencies in constant memory (~5KB for millions of keys) with O(1) operations. It provides conservative frequency estimates (never under-counts, may over-estimate), making it ideal for detecting hot keys without storing exact counters. Trade-off: tunable accuracy vs memory footprint.
Dynamic Hot Key Replication:
Goal: Prevent hot keys (e.g., celebrity users, viral content) from overwhelming a single cache node and creating bottlenecks.
Approach:
-
Detection threshold: Configure the request rate that triggers replication
- Too low = unnecessary replication overhead (memory waste across multiple nodes)
- Too high = hot keys cause bottlenecks before mitigation kicks in
- Determine based on single-node capacity and typical access patterns
-
Replication factor selection: Choose how many replicas to create
- Calculate: \(\text{replicas\_needed} = \lceil \frac{\text{hot\_key\_traffic}}{\text{single\_node\_capacity}} \rceil\)
- Trade-off: More replicas = better load distribution but higher memory overhead
- Consider network topology (replicate across availability zones for resilience)
-
Load distribution: Spread reads across replicas
- Random selection = simple, uniform distribution
- Locality-aware = lower latency but more complex routing
How to determine values:
- Measure your cache node’s request handling capacity under load
- Profile your key access distribution (use histograms or probabilistic counters)
- Set detection threshold at 60-80% of single-node capacity to trigger before saturation
- Calculate replication factor dynamically: \(\max\left(2, \lceil \frac{\text{observed\_traffic}}{\text{node\_capacity}} \rceil\right)\)
Workload Isolation: Separating Batch from Serving Traffic
One critical lesson from large-scale systems: never let batch workloads interfere with serving traffic.
The Problem:
Hourly batch jobs updating user profiles in CockroachDB (millions of writes/hour) can interfere with serving layer reads for ad personalization. Without isolation, batch writes can:
- Saturate disk I/O (batch writes compete with serving reads)
- Fill up queues and increase latency (p99 latency spikes from 20ms to 200ms)
- Trigger compactions that block reads
Solution: Read/Write Replica Separation
Goal: Isolate batch write workloads from latency-sensitive serving reads to prevent I/O contention, queue buildup, and compaction-induced stalls.
Approach:
-
Workload characterization: Measure your read/write ratio and latency requirements
- Serving traffic: high-volume reads, strict latency SLAs (e.g., <20ms p99)
- Batch jobs: bursty writes, throughput-focused, can tolerate higher latency
-
Capacity allocation strategy: Dedicate infrastructure based on workload intensity
- Calculate: \(\text{batch\_capacity} = \frac{\text{batch\_write\_throughput} \times \text{replication\_factor}}{\text{node\_write\_capacity}}\)
- Calculate: \(\text{serving\_capacity} = \frac{\text{serving\_read\_throughput} \times \text{safety\_margin}}{\text{node\_read\_capacity}}\)
- Trade-off: Over-provisioning batch capacity wastes resources; under-provisioning causes spillover that degrades serving latency
-
Consistency vs staleness trade-off: Decide what staleness is acceptable for serving reads
- Strong consistency = all reads hit the write leader (no isolation benefit, full contention)
- Eventual consistency = reads from local replicas (isolation achieved, but data may be slightly stale)
- Determine staleness tolerance based on business requirements (user profiles can tolerate seconds of lag, financial data may require strong consistency)
-
Topology design: Pin workloads to specific regions/nodes
- Use database-specific primitives (range leases, follower reads, read replicas)
- Concentrate batch writes on dedicated infrastructure
- Serve reads from separate replicas that aren’t absorbing write load
How to determine capacity split:
- Profile your workload: measure read QPS, write QPS, and their respective resource consumption
- Calculate resource needs: \(\text{serving\_nodes} = \lceil \frac{\text{read\_load}}{\text{node\_capacity} \times \text{target\_utilization}} \rceil\)
- Calculate batch needs: \(\text{batch\_nodes} = \lceil \frac{\text{write\_load} \times \text{replication\_factor}}{\text{node\_write\_capacity}} \rceil\)
- Validate with load testing that serving latency remains stable during batch job execution
Cost of isolation: You’re essentially paying for separate infrastructure to prevent contention. The cost is proportional to your batch workload intensity. If batch jobs consume 30% of total database operations, expect to provision roughly 30-40% additional capacity for isolation (accounting for replication overhead).
Monitoring the gap:
Track replication lag between batch and serving replicas:
$$\text{Replication lag} = Timestamp_{\text{serving replica}} - Timestamp_{\text{batch replica}}$$
If lag exceeds 5 minutes, you might have a problem. Scale the batch replica or throttle batch writes.
Cache Invalidation Strategies
Problem: When user data updates (e.g., profile change), how to invalidate stale cache?
Strategy 1: TTL-Based (Passive)
Set time-to-live on cache entries: $$\text{Staleness} \leq \text{TTL}$$
Pros:
- Simple implementation
- No coordination required
Cons:
- Guaranteed staleness up to TTL
- Unnecessary cache misses after TTL
Strategy 2: Active Invalidation (Event-Driven)
On data update:
- Publish invalidation event to Kafka topic
- All cache servers subscribe and evict key from L1/L2
Latency:
Kafka publish latency: ~5ms Consumer processing: ~10ms Total invalidation propagation: ~15ms
Pros:
- Low staleness (< 100ms)
- No unnecessary evictions
Cons:
- Requires event streaming infrastructure
- Network overhead for invalidation messages
Strategy 3: Versioned Caching
Include version in cache key:
- Cache key format:
user_id:version(e.g., “user123:v2”)
On update:
- Increment version in metadata store
- New requests fetch new version
- Old version expires naturally via TTL
Pros:
- No explicit invalidation needed
- Multiple versions coexist temporarily
Cons:
- Metadata store becomes critical path
- Higher cache memory usage (duplicate versions)
Hybrid Approach (Recommended):
Architectural Drivers: Latency vs Financial Accuracy - We use eventual consistency (30s TTL) for user preferences to meet latency targets, but strong consistency (active invalidation) for GDPR opt-outs where legal compliance is non-negotiable.
- Normal updates: TTL = 30s (passive invalidation)
- Critical updates (e.g., GDPR opt-out): Active invalidation via Kafka
- Version metadata for tracking update history
Privacy-Preserving Attribution: SKAdNetwork & Privacy Sandbox
Architectural Driver: Signal Availability - When 40-60% of traffic lacks stable user_id (ATT opt-out, Privacy Sandbox), traditional click-to-conversion attribution breaks. SKAdNetwork (iOS) and Attribution Reporting API (Chrome) provide privacy-preserving alternatives with delayed, aggregated conversion data.
The Attribution Challenge
Traditional attribution: User clicks ad → store user_id + click_id → user converts → match conversion to click via user_id → attribute revenue.
This fails when:
- iOS user opts out of ATT → no IDFA to link click and conversion
- Chrome Privacy Sandbox → third-party cookies unavailable
- Cross-device journeys → user clicks on phone, converts on desktop
Privacy frameworks provide attribution without persistent identifiers:
SKAdNetwork Postback Handling (iOS)
Apple’s SKAdNetwork provides conversion data for ATT opt-out users through delayed postbacks. When a user clicks an ad and installs an app, iOS starts a privacy timer (24-72 hours, randomized). If the user converts within the app during this window, the app signals the conversion to SKAdNetwork. After the timer expires, Apple sends an aggregated postback to the ad network containing campaign-level attribution data.
Critical architectural constraints:
The postback contains only campaign identifier and a 6-bit conversion value (0-63) - no user identity, device ID, or precise conversion details. This forces a fundamentally different attribution model:
- Campaign-level aggregation only: Individual user journeys are invisible; optimization happens at campaign cohorts
- Delayed feedback loop: 1-3 day lag between conversion and attribution means ML models train on stale data
- Coarse conversion signals: 64 possible values must encode all conversion types (trials, purchases, subscription tiers)
- No creative/keyword attribution: Cannot determine which ad variant drove the conversion
Data pipeline integration:
graph TB
SKAN[SKAdNetwork Postback
HTTPS webhook]
KAFKA[Kafka Topic
skan-postbacks]
FLINK[Flink Processor
Aggregate by campaign]
CRDB[CockroachDB
campaign_conversions table]
SKAN -->|Parse & validate| KAFKA
KAFKA --> FLINK
FLINK -->|campaign_id, conversion_value, count| CRDB
style SKAN fill:#f9f,stroke:#333
style KAFKA fill:#ff9,stroke:#333
style FLINK fill:#9ff,stroke:#333
style CRDB fill:#9f9,stroke:#333
Storage and aggregation:
Postbacks arrive as HTTPS webhooks, get queued in Kafka for reliability, then aggregated by Flink into campaign-level conversion metrics. The database stores daily aggregates partitioned by date: campaign identifier, conversion value, postback count, and revenue estimates.
Conversion value interpretation:
Advertisers map the 64-bit conversion space to their business model. Common patterns include quartile-based revenue brackets (0-15 for trials/signups, 16-31 for small purchases, 32-47 for medium, 48-63 for high-value conversions) or subscription tier encoding. The mapping becomes a critical product decision since it defines what the ML models can optimize for.
Trade-offs accepted:
- No user-level attribution: Only campaign-level aggregates
- Delayed reporting: 1-3 days lag before optimization possible
- Coarse signals: 64 possible conversion values for all events
- Revenue: SKAdNetwork campaigns achieve 60-70% of IDFA campaign performance due to delayed optimization
Privacy Sandbox Attribution Reporting API (Chrome)
Chrome’s Attribution Reporting API offers two distinct privacy models: event-level reports that link individual clicks to conversions with heavy noise (only 3 bits of conversion data, delayed 2-30 days), and aggregate reports that provide detailed conversion statistics across many users protected by differential privacy. The browser mediates all attribution, storing click events locally and generating reports after random delays to prevent timing attacks.
Integration approach:
Reports arrive at a dedicated endpoint, flow through the same Kafka-Flink-CockroachDB pipeline as SKAdNetwork postbacks, and aggregate into unified campaign-level metrics. This allows treating iOS and Chrome privacy-preserving attribution as a single conceptual layer despite different underlying mechanisms.
Maturity considerations:
Privacy Sandbox is evolving through 2024/2025. Attribution Reporting API is in origin trials (pre-production testing), Topics API is already integrated for contextual interest signals (Part 1), and Protected Audience API (formerly FLEDGE) for on-device auctions remains on the roadmap. The architecture must accommodate API changes as specifications stabilize.
Operational impact:
| Attribution Method | Coverage | Latency | Granularity | Revenue Performance |
|---|---|---|---|---|
| Traditional (cookie/IDFA) | 40-60% (declining) | Real-time | User-level | 100% baseline |
| SKAdNetwork | iOS opt-out users | 24-72 hours | Campaign-level | 60-70% of baseline |
| Privacy Sandbox | Chrome users | 2-30 days | Event-level (noised) or aggregate | 50-80% of baseline (evolving) |
| Contextual-only | All users | Real-time | Request-level | 50-70% of baseline |
Our approach:
- Layer attribution methods: traditional where available, privacy-preserving fallbacks
- Accept delayed optimization for privacy-compliant inventory
- Focus optimization on high-signal traffic (logged-in users, first-party data)
Immutable Financial Audit Log: Compliance Architecture
The Compliance Gap
CockroachDB operational ledger is mutable by design - optimized for operational efficiency but violating financial compliance:
- Budget corrections: UPDATE operations modify balances retroactively
- Schema evolution: ALTER TABLE changes data structure
- Data cleanup: DELETE removes old transaction records
- Admin access: DBAs can modify or delete historical financial data
Regulatory violations:
- SOX (Sarbanes-Oxley): Requires immutable audit trail for financial reporting accuracy
- Tax regulations: 7-year retention of unmodifiable transaction records (IRS Circular 230, EU tax directives)
- Advertiser disputes: Need cryptographically verifiable billing history for dispute resolution
- Payment processor compliance: Visa/Mastercard mandates immutable transaction logs
Solution: Dual-Ledger Architecture
Separate operational concerns (performance) from compliance concerns (immutability) using distinct systems:
Operational Ledger (CockroachDB):
- Purpose: Real-time transactional system for budget checks and billing writes
- Mutability: YES (optimized for corrections, cleanup, operational flexibility)
- Query patterns: Current balance, recent transactions, hot campaign data
- Retention: 90 days (then archived to cold storage for cost optimization)
- Performance: 3ms budget deduction writes, 10ms transactional reads
Immutable Audit Log (Kafka → ClickHouse):
- Purpose: Permanent compliance record, non-repudiable financial history
- Mutability: NO (append-only storage with cryptographic hash chaining)
- Query patterns: Historical spend analysis, dispute investigation, tax reporting, audit queries
- Retention: 7 years (minimum tax compliance requirement)
- Performance: Asynchronous ingestion (<5s lag), no impact on operational latency
graph TB
subgraph OPERATIONAL["Operational Systems (Real-Time)"]
BUDGET[Budget Service
3ms latency]
BILLING[Billing Service
Charges & Refunds]
CRDB[(CockroachDB
Operational Ledger
Mutable
90-day retention)]
end
subgraph PIPELINE["Event Pipeline"]
KAFKA[Kafka Topic
financial-events
30-day retention
3x replication]
end
subgraph AUDIT["Immutable Audit Log"]
CH_KAFKA[ClickHouse
Kafka Engine Table]
CH_MV[Materialized View
Transform JSON]
CH_STORAGE[(ClickHouse
MergeTree Storage
Immutable
7-year retention
Hash chaining)]
end
subgraph QUERY["Query Interfaces"]
RECON[Daily Reconciliation Job
Automated 2AM UTC]
METABASE[Metabase Dashboard
Finance Team]
SQL[SQL Client
External Auditors]
EXPORT[Parquet Export
Quarterly Audits]
end
BUDGET -->|Async publish
non-blocking| KAFKA
BILLING -->|Async publish
non-blocking| KAFKA
BUDGET -->|Sync write
3ms| CRDB
BILLING -->|Sync write
5ms| CRDB
KAFKA -->|Real-time consume
5s lag| CH_KAFKA
CH_KAFKA --> CH_MV
CH_MV --> CH_STORAGE
RECON -.->|Query operational| CRDB
RECON -.->|Query audit| CH_STORAGE
METABASE -.->|Ad-hoc queries| CH_STORAGE
SQL -.->|Read-only access| CH_STORAGE
EXPORT -.->|Quarterly extract| CH_STORAGE
style BUDGET fill:#e3f2fd
style BILLING fill:#e3f2fd
style CRDB fill:#fff3e0
style KAFKA fill:#f3e5f5
style CH_STORAGE fill:#e8f5e9
style RECON fill:#ffebee
Event Pipeline Architecture
Event Flow: Budget Service and Billing Service emit structured financial events (budget deductions, impression charges, refunds, allocations) to Kafka financial-events topic asynchronously. Each event contains event type, campaign/advertiser IDs, amount, timestamp, and correlation IDs for traceability.
Kafka Buffer: Topic configured with 30-day retention (safety buffer during ClickHouse downtime), partitioned by campaignId for ordering guarantees, 3× replication for durability. Capacity: 100K events/sec (10% of platform QPS generating financial events).
ClickHouse Ingestion: Kafka Engine table consumes events directly, Materialized View transforms JSON into columnar schema optimized for analytics. MergeTree storage provides append-only immutability with automatic ZSTD compression (65% reduction). Ingestion lag: <5 seconds from event generation to queryable.
4. Audit Query Patterns
ClickHouse OLAP optimization enables sub-second queries for compliance scenarios:
Campaign Spend History (Tax Reporting): Aggregate all budget deductions for specific campaign over annual period. Common during tax filing season when advertisers request detailed spending breakdowns by campaign, geography, and time period. ClickHouse columnar storage and partition pruning enable sub-500ms queries across billions of events when filtering by campaign and time-range.
Dispute Investigation (Billing Accuracy):
Trace complete event sequence for specific request ID when advertiser disputes charge. Requires chronological ordering of all events (budget deduction, impression charge, click attribution, refund if applicable) to reconstruct exact billing calculation. Bloom filter index on requestId enables <100ms single-request retrieval even across multi-year dataset.
Reconciliation Analysis (Data Integrity): Compare daily aggregate spend between operational ledger (CockroachDB) and audit log (ClickHouse) to detect discrepancies. Requires grouping by campaign with tolerance for rounding differences. ClickHouse materialized views pre-compute daily aggregates for instant reconciliation queries.
Compliance Audit Trail (SOX/Regulatory): External auditors query complete financial history for specific advertiser or time period. Requires filtering by advertiser ID, event type (budget allocations, deductions, refunds), and date range with multi-dimensional grouping. ClickHouse query performance remains sub-second for most audit scenarios due to partition pruning and columnar compression.
Query Access Control
Access Restriction Policy: Financial audit log is classified data with restricted access per Segregation of Duties (SOX compliance). Default access: NONE. Only designated roles below have explicit permissions:
Automated Systems:
- Daily Reconciliation (Airflow service account): Compares operational vs audit ledger aggregates, alerts on variance >0.01 or >0.001%
- Quarterly Export (scheduled job): Generates Parquet files with cryptographic hash verification for compliance audits
Finance Team: Read-only Metabase access (SSO auth, 30s timeout, 100K row limit). Authorized queries: campaign spend trends, refund analysis, advertiser billing summaries, budget utilization reports. Handles all billing dispute investigations requiring financial data access.
External Auditors: Temporary credentials (expire post-audit) with pre-approved query templates for: annual tax reporting, SOX compliance verification, advertiser reconciliation. Complex queries scheduled off-peak. All auditor activity logged separately for compliance record.
Break-Glass Access: Emergency investigation (data corruption, critical billing bug) requires VP Finance + VP Engineering approval, limited to 1-hour window, full session recording, mandatory post-incident compliance review.
ClickHouse Storage Design
MergeTree Configuration: Ordering key (campaignId, timestamp) optimizes campaign history queries. Monthly partitioning toYYYYMM(timestamp) enables efficient pruning for tax/annual reports. ZSTD compression achieves 65% reduction (200GB/day → 70GB/day). Bloom filter index on requestId enables <100ms dispute lookups.
Immutability Enforcement: MergeTree prohibits UPDATE/DELETE operations by design. Administrative changes require explicit ALTER TABLE DROP PARTITION (logged separately). Each row includes SHA-256 previousHash creating tamper-evident chain - modification breaks hash sequence, detected during quarterly verification.
Performance & Cost: Asynchronous write path (1-2ms Kafka publish, <5s ingestion lag) has zero operational latency impact. Query performance: <500ms simple aggregations, 1-3s complex analytics, <100ms dispute lookups. Storage: 180TB for 7-year retention (70GB/day × 2,555 days), approximately 15-20% of database infrastructure cost. Sub-second queries over billions of rows via columnar OLAP optimization.
Daily Reconciliation Process
Automated verification ensuring operational and audit ledgers remain synchronized. This process validates data integrity and detects system issues before they compound into billing disputes.
Reconciliation Job (Airflow DAG, scheduled 2:00 AM UTC daily):
Step 1: Extract Daily Aggregates from Both Systems
Query operational ledger (CockroachDB) and audit log (ClickHouse) for previous 24 hours, aggregating spend per campaign. Operational ledger contains real-time mutable data (90-day retention), while audit log contains immutable append-only events (7-year retention). Aggregation groups by campaign ID, summing budget deductions and impression charges while excluding refunds (handled separately).
Step 2: Compare Aggregates with Tolerance
Per-campaign validation accepts minor differences due to rounding and microsecond-level timing variations. Match tolerance set at 1 cent OR 0.001% of campaign total (whichever greater). For example, campaign with 10,000 spend allows up to 10 cents variance, while small campaign with 5 spend allows 1 cent variance. This tolerance accounts for floating-point rounding in budget calculations and clock skew between systems.
Step 3: Alert on Significant Discrepancies
P1 PagerDuty alert triggered when campaign variance exceeds threshold. Alert includes: affected campaign IDs, operational vs audit totals, percentage variance, and trend analysis (has this campaign had previous mismatches?). Dashboard visualization shows aggregate delta across all campaigns, enabling quick identification of systemic issues (e.g., Kafka consumer lag affecting all campaigns vs isolated campaign-specific bug).
Step 4: Forensic Investigation
Drill-down analysis retrieves complete event sequence for mismatched campaign from both systems. Event correlation matches operational ledger entries with audit log events by request ID to identify missing events (operational wrote but Kafka publish failed), duplicate events (retry caused double-write), or timing mismatches (event arrived after reconciliation window). Most common root causes:
- Kafka lag (85% of discrepancies): Consumer backlog delays event ingestion >24 hours, resolves automatically when ClickHouse catches up
- Schema mismatch (10%): Field rename in event schema without updating ClickHouse parser, requires parser fix and backfill
- Event emission bug (5%): Edge case where service fails to publish event, requires code fix and manual backfill with audit justification
Step 5: Automated Resolution Tracking
Reconciliation job stores results in dedicated tracking table: campaign ID, discrepancy amount, detection timestamp, resolution status. Daily report summarizes: total campaigns reconciled, mismatch count, average variance, unresolved discrepancy age. Historical trend analysis detects degrading data quality (increasing mismatch rate signals systemic problem requiring investigation).
Historical Success Rate: 99.999%+ campaigns match daily (typically 0-3 discrepancies out of 10,000+ active campaigns). Most discrepancies resolve automatically within 24-48 hours as delayed Kafka events arrive. Only 1-2 cases per month require manual intervention (code bug fixes, schema corrections, or manual backfill with approval workflow).
Auction Mechanism Design
First-Price Auctions: Industry Standard for RTB
Since 2019, the programmatic advertising industry has standardized on first-price auctions for Real-Time Bidding (RTB) and display advertising. In a first-price auction, the winner pays their bid - not the second-highest bid.
Why First-Price Became Standard:
The industry shifted from second-price to first-price auctions to address transparency concerns and bid landscape visibility. Key drivers:
- Header bidding transparency: Publishers could see all bids, making second-price manipulation visible
- Simpler economics: “Winner pays bid” is easier to explain than second-price mechanisms
- DSP preference: Major demand-side platforms (Google DV360, The Trade Desk) prefer first-price with bid shading
- Revenue impact: First-price with bid shading generates 5-15% higher revenue in practice (theoretical revenue neutrality assumes perfect shading, but DSPs shade conservatively)
Auction Setup:
- \(N\) advertisers submit bids \(b_1, b_2, \ldots, b_N\)
- Each ad has predicted CTR (Click-Through Rate): \(\text{CTR}_1, \text{CTR}_2, \ldots, \text{CTR}_N\) - the probability a user clicks the ad when shown
- Single ad slot to allocate
Effective Bid (eCPM - effective Cost Per Mille):
Advertisers use different pricing models - some pay per impression (CPM), others per click (CPC), others per conversion (CPA). To compare apples-to-apples, we convert all bids to eCPM: expected revenue per 1000 impressions.
The conversion formulas are as follows:
$$ \begin{array}{ll} \text{CPM bid:} & eCPM = CPM (direct) \\ \text{CPC bid:} & eCPM = CPC \times CTR \times 1000 \\ \text{CPA bid:} & eCPM = CPA \times conversion\_rate \times CTR \times 1000 \end{array} $$
This normalizes bids across pricing models: eCPM represents expected revenue per 1000 impressions, accounting for how likely users are to click.
Why this matters: A higher CPC bid with low CTR (5%) may earn less than a lower CPC bid with high CTR (15%). The platform maximizes revenue by selecting the highest eCPM, not highest raw bid.
Winner Selection:
$$w = \arg\max_{i \in [1,N]} \text{eCPM}_i$$
Price Determination (First-Price):
The winner pays their bid (not the second-highest bid):
$$p_w = b_w$$
This is fundamentally different from second-price auctions where winners paid just enough to beat the runner-up.
Example:
| Advertiser | Bid | CTR | eCPM | Rank |
|---|---|---|---|---|
| A | B_a | 0.10 | B_a × 0.10 × 1000 = 100 × B_a | 2 |
| B | B_b | 0.15 | B_b × 0.15 × 1000 = 150 × B_b | 1 |
| C | B_c | 0.05 | B_c × 0.05 × 1000 = 50 × B_c | 3 |
Winner: Advertiser B (highest eCPM multiplier: 150× vs 100× vs 50×)
Price paid by B in first-price auction: $$p_B = b_B = B_b$$
Advertiser B pays their full bid amount.
Comparison: Second-Price vs First-Price
In a second-price auction (historical approach), Advertiser B would have paid just enough to beat A’s eCPM (by a small increment). In first-price, they pay their full bid.
The Bid Shading Response:
First-price auctions incentivize bid shading - DSPs use machine learning to predict the minimum bid needed to win and bid slightly above that. This recovers much of the economic efficiency of second-price auctions while maintaining transparency. (See “Bid Shading in First-Price Auctions” section below for details.)
Quality Score and Ad Rank
Ads are ranked by eCPM = bid × CTR, but in practice ad quality also matters for user experience.
The Quality Problem:
Consider two advertisers:
- Advertiser X: Higher bid, fast landing page, relevant ad copy → users happy
- Advertiser Y: Slightly higher bid, slow landing page, misleading ad → users complain
Should Y win just because they bid more? This degrades user experience.
Google’s Solution: Quality Score
Since ~2005, Google Ads has incorporated Quality Score into auction ranking:
$$\text{Ad Rank} = \text{Bid} \times \text{Quality Score}$$
Quality Score Components (1-10 scale):
Google evaluates three components, though exact weights are not publicly disclosed:
- Expected CTR (highest impact): Historical click-through rate for this keyword/ad combination
- Landing Page Experience (highest impact): Page load speed, mobile-friendliness, content relevance, security (HTTPS)
- Ad Relevance (moderate impact): How well ad text matches search query intent
Note: Research shows improving CTR or Landing Page Experience has roughly twice the impact of improving Ad Relevance. Focus optimization efforts on the top two components.
Modified Auction Ranking:
Instead of ranking by eCPM alone, rank by Ad Rank:
$$\text{Ad Rank}_i = b_i \times \text{CTR}_i \times \text{QualityScore}_i \times 1000$$
Example: Quality Beats Price
| Advertiser | Bid | CTR | Quality Score | Ad Rank | Position | Winner? |
|---|---|---|---|---|---|---|
| X | B_low | 0.15 | 10/10 (excellent) | Quality-adjusted eCPM_high | 1 | Yes |
| Y | B_high (40% higher) | 0.15 | 6/10 (poor landing page) | Quality-adjusted eCPM_lower | 2 | No |
Advertiser X wins despite lower raw bid because of higher quality (10/10 vs 6/10).
System Design Implications:
1. Data Pipeline Requirements:
- Historical CTR tracking: Store click/impression data per advertiser-keyword pair
- Landing page metrics: Collect page load times, bounce rates, mobile scores
- Real-time signals: HTTPS status, page availability checks
- Storage: Time-series database for CTR history, key-value store for current quality scores
2. Computation Architecture:
Quality Score is computed offline by ML model, cached, and served at auction time:
graph
subgraph "Offline Pipeline - Runs Daily/Weekly"
direction BT
CACHE_WRITE[Cache Update
Redis/Memcached
Atomic Swap]
PREDICT[Quality Score Prediction
All Advertiser-Keyword Pairs
Millions of Combinations]
TRAIN[ML Model Training
XGBoost/Neural Net
Hours of Batch Processing]
HD[(Historical Data Store
Time-Series DB
Billions of Auction Events)]
HD --> TRAIN
TRAIN --> PREDICT
PREDICT --> CACHE_WRITE
end
subgraph "Online Pipeline - Real-Time <100ms"
direction TB
AUCTION[Auction Request
User Query + Bids
N Advertisers]
CACHE_LOOKUP{Cache Lookup
Redis Read
< 1ms}
CACHE_HIT[Quality Score Retrieved
99%+ Hit Rate]
CACHE_MISS[Cache Miss
Use Default Score = 7/10
< 1% Rate]
COMPUTE[Compute Ad Rank
Bid × CTR × QualityScore
< 1ms]
FIRST_PRICE[First-Price Auction
Rank & Select Winner
< 5ms]
RESULT[Auction Result
Winner + Price
Click/Impression Event]
AUCTION --> CACHE_LOOKUP
CACHE_LOOKUP -->|Hit| CACHE_HIT
CACHE_LOOKUP -->|Miss| CACHE_MISS
CACHE_HIT --> COMPUTE
CACHE_MISS --> COMPUTE
COMPUTE --> FIRST_PRICE
FIRST_PRICE --> RESULT
end
style HD fill:#e1f5ff
style TRAIN fill:#e1f5ff
style PREDICT fill:#e1f5ff
style CACHE_WRITE fill:#e1f5ff
style AUCTION fill:#fff4e1
style CACHE_LOOKUP fill:#fffacd
style CACHE_HIT fill:#d4edda
style CACHE_MISS fill:#f8d7da
style COMPUTE fill:#fff4e1
style FIRST_PRICE fill:#fff4e1
style RESULT fill:#fff4e1
3. Performance Considerations:
- Latency impact: Quality score lookup adds ~0.5-1ms to auction (cache hit)
- Cache warming: Pre-compute scores for active advertisers (99%+ hit rate)
- Fallback: Default quality score (e.g., 7/10) if cache miss
- Update frequency: Quality scores change slowly (update daily, not per-auction)
4. ML Model Deployment:
- Training data: Billions of historical auctions (click events, landing page metrics)
- Features: Ad-keyword relevance (NLP embeddings), historical CTR, page speed metrics
- Model serving: Offline batch prediction, not real-time inference (too slow for auction latency)
- A/B testing: Shadow scoring to test model changes before production
Relationship to First-Price Auctions:
Quality-adjusted first-price auctions work the same way:
- Rank by: Bid × CTR × Quality Score (Ad Rank)
- Pay: Your bid (first-price)
The quality score affects ranking (who wins) but not the fundamental pricing (winner pays bid). This encourages advertisers to improve landing pages, ad relevance, and user experience to achieve better ad positions at lower bids.
Computational Complexity
First-Price Auction Complexity:
- Sort advertisers by eCPM: \(O(N \log N)\)
- Select winner and compute price: \(O(1)\) (winner pays bid - no second-price calculation needed)
- Total: \(O(N \log N)\)
For \(N = 50\) DSPs:
- First-price: ~282 operations (sort + select)
Latency Impact:
At 5ms budget for auction logic:
- First-price auction: easily achievable
- Sorting 50 DSPs by eCPM: <1ms with optimized comparisons
- Winner selection: <0.1ms (just pick highest eCPM)
Implementation Note: First-price auctions are computationally identical to second-price auctions (both O(N log N)). The difference is purely in pricing: first-price returns the winner’s bid, while second-price calculates the minimum bid needed to beat the runner-up.
Reserve Prices and Floor Prices
The Problem:
Without a reserve price (minimum bid), your auction might sell ad slots for very low prices when competition is low. Consider a scenario where only one advertiser bids far below market value for a premium slot - you’d rather show a house ad (promoting your own content) than sell it that cheaply.
What is a Reserve Price?
A reserve price \(r\) is the minimum eCPM required to participate in the auction. If no bids exceed \(r\), the impression is not sold (or filled with a house ad).
The Revenue Trade-off:
Setting the reserve price is a balancing act:
| Reserve Price | What Happens | Example |
|---|---|---|
| Too low (0.25× market rate) | Sell almost all impressions, but accept low-value bids | 95% fill rate × low avg eCPM = suboptimal revenue |
| Optimal (market rate) | Balance between fill rate and price | 70% fill rate × good avg eCPM = optimal revenue |
| Too high (5× market rate) | Only premium bids qualify, but most impressions go unsold | 20% fill rate × high avg eCPM = suboptimal revenue |
Mathematical Formulation:
Expected revenue per impression with reserve price \(r\):
$$\text{Revenue}(r) = r \times P(\text{bid} \geq r)$$
where \(P(\text{bid} \geq r)\) is the probability that at least one bid exceeds the reserve.
Optimal Reserve Price:
Find \(r^*\) that maximizes expected revenue. If bids follow a known distribution with CDF \(F(v)\):
$$r^* = \arg\max_r \left[ r \times (1 - F(r)) \right]$$
Interpretation:
- \(r\) = revenue when impression sells
- \((1 - F(r))\) = probability impression sells (fraction of bids above \(r\))
Concrete Example:
Suppose historical bids range uniformly from zero to maximum bid B_max. What’s the optimal reserve?
For uniform distribution: \(P(\text{bid} \geq r) = 1 - \frac{r}{10}\)
Expected revenue: $$\text{Revenue}(r) = r \times \left(1 - \frac{r}{10}\right) = r - \frac{r^2}{10}$$
Maximize by taking derivative: $$\frac{d}{dr}\left(r - \frac{r^2}{10}\right) = 1 - \frac{2r}{10} = 0$$
$$r^* = \frac{B_{max}}{2}$$
Result: Optimal reserve is half the maximum bid value (when bids are uniformly distributed).
Practical Approach:
Rather than assuming a distribution, use empirical data:
- Analyze historical bid distribution from past auctions
- Simulate different reserve prices offline and estimate revenue impact
- Run A/B tests with small traffic percentages to validate optimal reserve
- Monitor fill rate vs. revenue trade-off continuously
Multi-Dimensional Reserve Prices:
In practice, reserve prices are often segmented by:
- Geo: Higher reserves for premium markets (US/UK) vs. developing markets
- Device: Mobile vs. desktop vs. CTV (Connected TV)
- User segment: High-value users (purchase intent) vs. casual browsers
- Time of day: Peak hours vs. off-peak
- Inventory quality: Above-the-fold vs. below-the-fold
Implementation Note:
Reserve prices work identically in first-price and second-price auctions - they filter out bids below the threshold before ranking. The difference is only in what the winner pays (their bid vs. second-highest bid).
Bid Shading in First-Price Auctions
With first-price auctions, DSPs face a strategic problem: bidding true value guarantees zero profit (you pay exactly what the impression is worth to you). This creates the bid shading optimization problem.
The Bid Shading Problem:
In first-price auctions:
- Bid too high: You win but overpay (negative ROI)
- Bid too low: You lose to competitors (missed opportunity)
- Optimal strategy: Bid just above the second-highest bidder (but you don’t know their bid!)
How Bid Shading Works:
DSPs use machine learning to predict the competitive landscape and bid strategically:
- Collect historical data: Track wins, losses, and winning prices across millions of auctions
- Build bid landscape model: For each impression context (user, publisher, time), predict:
- Probability of winning at price \(p\): \(P(\text{win} | \text{bid} = p)\)
- Distribution of competitor bids
- Optimize bid: Choose bid \(b\) that maximizes expected profit:
$$b^* = \arg\max_b \left[ (v - b) \times P(\text{win} | b) \right]$$
where \(v\) is the true value of the impression to the advertiser.
Example:
Suppose an advertiser values an impression at V_imp (based on predicted conversion rate). The bid landscape model predicts:
- Bid V_imp: 90% win rate (no profit - paying true value)
- Bid 0.80 × V_imp: 75% win rate (expected profit: 0.20 × V_imp × 75% = 0.15 × V_imp)
- Bid 0.70 × V_imp: 60% win rate (expected profit: 0.30 × V_imp × 60% = 0.18 × V_imp)
- Bid 0.60 × V_imp: 40% win rate (expected profit: 0.40 × V_imp × 40% = 0.16 × V_imp)
Optimal bid: 0.70 × V_imp (maximizes expected profit at 0.18 × V_imp per auction)
Why First-Price + Bid Shading ≈ Second-Price:
Bid shading recovers much of the economic efficiency of second-price auctions:
- Second-price: Winner pays second-highest bid
- First-price + shading: Winner bids slightly above predicted second-price
The small difference represents the DSP’s uncertainty about the competitive landscape. As bid landscape models improve, first-price with shading converges toward second-price revenue.
System Design Implications:
From the SSP (supply-side platform) perspective:
- Expect strategic bidding: DSPs will NOT bid true value - this is intentional and economically efficient
- Bid landscape opacity: Don’t share winning bid distributions (preserves auction integrity)
- Revenue impact: First-price with bid shading can generate approximately 5-15% higher revenue than second-price in practice, though exact figures vary by market conditions and DSP sophistication. The revenue lift comes from imperfect bid shading - DSPs tend to shade conservatively to avoid losing auctions, resulting in slightly higher clearing prices.
Implementation Note: SSPs don’t implement bid shading - that’s the DSP’s responsibility. The SSP simply runs a first-price auction (rank by eCPM, winner pays bid). The complexity of bid optimization happens on the demand side.
Historical Context: Second-Price Auctions
Before 2019, the programmatic advertising industry primarily used second-price auctions (specifically, Generalized Second-Price or GSP auctions). Understanding this history helps explain design decisions in legacy systems and why the industry shifted to first-price.
Why Second-Price Was Popular (2000s-2018):
- Theoretical elegance: Encouraged truthful bidding (in theory)
- Simpler for advertisers: “Bid your true value” was easier to explain than bid shading
- Google’s influence: Google Search Ads used GSP successfully, setting industry precedent
- Established ecosystem: Bidding algorithms optimized for second-price dynamics
How Second-Price (GSP) Works:
In a second-price auction, the winner pays just enough to beat the second-highest bidder:
$$p_w = \frac{\text{eCPM}_{2nd}}{\text{CTR}_w \times 1000} + \epsilon$$
where \(\epsilon\) is a small increment.
Example:
| Advertiser | Bid | CTR | eCPM | Rank |
|---|---|---|---|---|
| A | B_a | 0.10 | 100 × B_a | 2 |
| B | B_b | 0.15 | 150 × B_b | 1 |
| C | B_c | 0.05 | 50 × B_c | 3 |
Winner: Advertiser B (highest eCPM multiplier: 150×)
Price paid by B in second-price: $$p_B = \frac{100 \times B_a}{0.15 \times 1000} = 0.67 \times B_a$$
Advertiser B only pays enough to beat A’s eCPM (not their full bid B_b).
Why the Industry Shifted to First-Price (2017-2019):
Several factors drove the migration:
- Header bidding transparency: Publishers could see all bids simultaneously, making second-price “bid reduction” visible and contentious
- Price floor manipulation: SSPs could manipulate second-price auctions by setting floors strategically
- Complexity: Second-price pricing logic was opaque (“Why did I pay less than my bid?”)
- DSP preference: Major DSPs (Google DV360, The Trade Desk) preferred first-price with their own bid shading
- Revenue impact: First-price with bid shading generates 5-15% higher revenue in practice (DSPs shade conservatively)
Timeline:
- 2017: AppNexus (now Xandr) pioneered first-price for programmatic
- 2018: Google AdX announced transition to first-price
- 2019: Industry-wide shift complete - first-price became standard for RTB
GSP Still Used for Sponsored Search:
Google Search Ads, Microsoft Ads, and Amazon Sponsored Products still use GSP (second-price) because:
- Established advertiser ecosystems
- Different transparency requirements (no header bidding)
- Decades of advertiser education and tooling
- Network effects (switching cost too high)
Key Difference: Search vs. Display:
| Auction Type | Used For | Pricing |
|---|---|---|
| GSP (Second-Price) | Sponsored search (Google Search Ads) | Winner pays second-highest + small increment |
| First-Price | Programmatic display/video/CTV (RTB) | Winner pays their bid |
This blog focuses on first-price auctions because they are the modern standard for Real-Time Bidding (RTB) and programmatic display advertising - the architecture described in this document.
Budget Pacing: Distributed Spend Control
Architectural Driver: Financial Accuracy - Pre-allocation pattern with Redis atomic counters ensures budget consistency across regions. Max over-delivery bounded to 1% of daily budget (acceptable legal risk) while avoiding centralized bottleneck.
Problem: Advertisers set daily budgets (e.g., daily limit). In a distributed system serving 1M QPS, how do we prevent over-delivery without centralizing every spend decision?
Challenge:
Centralized approach (single database tracks spend):
- Latency: ~10ms per spend check
- Throughput bottleneck: ~100K QPS max
- Single point of failure
Solution: Pre-Allocation with Periodic Reconciliation
graph TD
ADV[Advertiser X
Daily Budget: B_daily]
ADV --> BUDGET[Atomic Pacing Service]
BUDGET --> REDIS[(Redis
Atomic Counters)]
BUDGET --> CRDB[(CockroachDB
Billing Ledger
HLC Timestamps)]
BUDGET -->|Allocate amount_1| AS1[Ad Server 1]
BUDGET -->|Allocate amount_2| AS2[Ad Server 2]
BUDGET -->|Allocate amount_3| AS3[Ad Server 3]
AS1 -->|Spent: S1
Return: unused_1| BUDGET
AS2 -->|Spent: S2
Return: unused_2| BUDGET
AS3 -->|Spent: S3
Return: unused_3| BUDGET
BUDGET -->|Periodic reconciliation
HLC timestamped| CRDB
TIMEOUT[Timeout Monitor
5min intervals] -.->|Release stale
allocations| REDIS
REDIS -->|Budget < 10%| THROTTLE[Dynamic Throttle]
THROTTLE -.->|Reduce allocation
size dynamically| BUDGET
classDef server fill:#e3f2fd,stroke:#1976d2
classDef budget fill:#fff3e0,stroke:#f57c00
classDef advertiser fill:#e8f5e9,stroke:#4caf50
class AS1,AS2,AS3 server
class BUDGET,REDIS,CRDB,TIMEOUT,THROTTLE budget
class ADV advertiser
How it works:
- Atomic Pacing Service pre-allocates budget chunks to Ad Servers (variable allocation amounts)
- Ad Servers spend from local allocation using Redis atomic counters (no coordination needed)
- Periodic reconciliation (every 30 seconds): Ad Servers return unused budget to Atomic Pacing Service
- CockroachDB records all spend events with HLC (Hybrid Logical Clock) timestamps for globally ordered audit trail
- Timeout Monitor releases stale allocations after 5 minutes (handles server crashes)
- Dynamic Throttle reduces allocation size when budget < 10% remaining (prevents over-delivery)
Budget Allocation Operations:
Allocation request (Ad Server requests budget chunk):
- Operation: Atomically decrement global budget counter (deduct allocation amount)
- Returns: Remaining budget or error if insufficient
- Frequency: Every 30-60 seconds per Ad Server
Reconciliation (Ad Server returns unused budget):
- Operation: Atomically increment global budget counter (return unused amount)
- Returns: Updated budget total
- Frequency: When allocation period expires or Ad Server scales down
Key Properties:
- Atomic operations:
DECRBYis atomic, prevents race conditions - No coordination latency: Each Ad Server decides locally
- Bounded over-delivery: Maximum over-delivery = (# servers) × (allocation size)
- Self-healing: Timeout monitor recovers from server failures
Mathematical Analysis:
Over-Delivery Bound:
Maximum over-delivery: $$\text{OverDelivery}_{max} = S \times A$$
where \(S\) = number of servers, \(A\) = allocation chunk size.
Example: 100 servers with allocation A each → max 100 × A over-delivery (10% of 1000 × A daily budget).
Mitigation: Dynamic allocation sizing.
When budget remaining drops below 10%: $$A_{new} = \frac{B_r}{S \times 10}$$
This reduces max over-delivery to ~1% of budget.
The Critical Path: Synchronous Budget Check (5ms)
The Missing Piece: The pre-allocation strategy above handles periodic budget allocation (every 30-60s), but doesn’t explain the per-request budget check that must happen on EVERY ad request at 1M QPS. This synchronous check is the critical path component that ensures financial accuracy while meeting the 150ms SLO.
The Challenge at 1M QPS:
Naive approach (query CockroachDB on every request):
- Latency: 10-15ms per query (p99)
- Result: Violates 150ms SLO (adds 10-15ms to critical path)
- Throughput: Creates massive database contention
Solution: Bounded Micro-Ledger (BML) Architecture
The BML system provides three-tier budget enforcement that achieves both low latency (3-5ms) and financial accuracy (bounded overspend):
Three-Tier BML Architecture (Critical Financial Atomicity Mechanism):
Tier 1: Synchronous Budget Check (Redis Lua Script - 3ms)
- Component: Atomic Pacing Service executes Lua script in Redis
- Function: Check if
current_spend + cost ≤ budget_limit + INACCURACY_BOUND - Critical Property: The
INACCURACY_BOUND(typically 0.5-1% of budget_limit) is the mathematical guarantee that ensures ≤1% billing accuracy - Atomicity: Lua script runs single-threaded in Redis, preventing race conditions
- Latency: 3ms avg (5ms p99) - fits within critical path budget
Tier 2: Asynchronous Delta Propagation (Redis → Kafka)
- Component: Redis publishes spend deltas to Kafka topic
- Function: Stream of spend events for audit trail and reconciliation
- Frequency: Every 5 seconds per campaign or on cumulative threshold
- Event format:
{campaign_id, spend_delta, timestamp, transaction_id} - Implementation: After Lua script completes successfully, Atomic Pacing Service emits event to Kafka asynchronously (non-blocking, does not impact 3ms budget)
Tier 3: Reconciliation Processor (Flink/Kafka Streams → CockroachDB)
- Component: Flink job consumes Kafka stream and batch-commits to CockroachDB
- Function: Maintain strong-consistency ledger as source of truth
- Batch window: 30-second aggregation window
- Strong consistency: CockroachDB ACID transactions with HLC timestamps
- Periodic sync: Every 60s, sync Redis counters from CockroachDB to correct drift
Why This Three-Tier Architecture is Required:
- Tier 1 alone: Fast but lacks audit trail and drift correction
- Tier 3 alone: Accurate but too slow (10-15ms) for 1M QPS critical path
- Combined: 3ms latency + mathematical bounded overspend + immutable audit trail
graph TB
subgraph "Synchronous Tier (3ms - Critical Path)"
REQ[Ad Request
1M QPS] --> AUCTION[Auction Selects Winner
Ad from Campaign X
Cost: C]
AUCTION --> BML_CHECK{BML: Atomic
Check & Deduct}
BML_CHECK -->|Budget OK| REDIS_LUA[Redis Lua Script
ATOMIC:
if spend+cost < limit+bound
then deduct
Latency: 3ms]
REDIS_LUA -->|SUCCESS| SERVE[Serve Ad
Revenue: C]
BML_CHECK -->|Budget EXHAUSTED| NEXT[Try Next Bidder
or House Ad]
end
subgraph "Asynchronous Tier (Reconciliation)"
REDIS_LUA -.->|Emit delta
every 5s| KAFKA[Kafka
Spend Events]
KAFKA -.-> FLINK[Flink
Aggregate]
FLINK -.->|Batch commit
every 30s| CRDB[(CockroachDB
Billing Ledger
Source of Truth)]
end
CRDB -.->|Periodic sync
every 60s| REDIS_LUA
classDef critical fill:#ffcccc,stroke:#cc0000,stroke-width:2px
classDef async fill:#ccffcc,stroke:#00cc00,stroke-dasharray: 5 5
class REQ,AUCTION,BML_CHECK,REDIS_LUA,SERVE critical
class KAFKA,FLINK,CRDB async
Bounded Micro-Ledger (BML) Components:
1. Synchronous Tier: Redis Atomic Counter (3ms Budget)
Purpose: Fast, atomic check-and-deduct for every ad request
Atomic Check-and-Deduct Algorithm:
The algorithm executes atomically within Redis (single-threaded, no concurrent modifications possible):
Inputs:
campaign_id: Which campaign to checkcost: Amount to spend for this ad impression (e.g., impression cost C)inaccuracy_bound: Safety buffer to prevent unbounded overspend (typically 0.5-1% of budget)
Algorithm Steps:
-
Read current state from Redis hash for this campaign:
current_spend: How much already spent todaybudget_limit: Daily budget cap
-
Calculate remaining budget:
remaining = budget_limit - current_spend
-
Atomic decision: Check if spend is allowed
- CRITICAL CONDITION (Key to ≤1% billing accuracy):
current_spend + cost ≤ budget_limit + inaccuracy_bound - If TRUE (budget available):
- Increment spend counter by
costatomically - Return SUCCESS with new remaining budget
- Increment spend counter by
- If FALSE (budget exhausted):
- Do NOT modify spend counter
- Return BUDGET_EXHAUSTED with current remaining
- CRITICAL CONDITION (Key to ≤1% billing accuracy):
Critical Design Property: The
inaccuracy_boundparameter in the Lua script condition is the mathematical enforcement mechanism that guarantees ≤1% billing accuracy. By settinginaccuracy_bound = 0.01 × budget_limit, we ensure maximum overspend is bounded to 1% of daily budget. This is the ONLY way to achieve bounded financial accuracy while maintaining 3ms latency at 1M QPS.
Why This is Atomic:
Redis executes the entire algorithm as a single atomic operation (Lua script runs single-threaded). Even if 1,000 requests arrive simultaneously, Redis processes them serially one-at-a-time, guaranteeing no race conditions.
Key Properties:
- Atomic: Lua script executes atomically in Redis (single-threaded execution)
- Fast: 5ms p99 total latency (3ms script execution + 2ms network RTT)
- Bounded inaccuracy: The
inaccuracy_bound($5) prevents unbounded overspend - High throughput: Redis handles 1M+ ops/sec per shard
2. Asynchronous Tier: Reconciliation to CockroachDB
Purpose: Periodic sync to source of truth for audit trail and accuracy
Reconciliation Process (Flink Stream Processing Job, runs every 30s):
Step 1: Aggregate Spending Deltas
- Flink consumes spend events from Kafka stream
- Groups events by
campaign_id - Aggregates total spend per campaign over 30-second window
- Example: Campaign 12345 spent $2.50 + $3.00 + $1.75 = $7.25 in this window
Step 2: Batch Commit to CockroachDB
- Open distributed transaction across CockroachDB cluster
- For each campaign with spending activity:
- Insert new spending record with HLC timestamp (for global ordering)
- If campaign record exists, increment cumulative spend counter
- If campaign record doesn’t exist, create new entry
- Commit transaction atomically across all shards
- CockroachDB ensures ACID guarantees and audit trail
Step 3: Sync Redis from Source of Truth (every 60s)
- Query CockroachDB for true cumulative spend per campaign
- Update Redis hash with authoritative spend values
- Detect drift: if Redis and CockroachDB differ by >$50, alert operations team
- This corrects any Redis cache inconsistencies (restarts, clock skew, missed events)
Why Two-Tier Works:
- Redis: Fast but eventually consistent (acceptable for bounded inaccuracy)
- CockroachDB: Slow but strongly consistent (source of truth for billing)
- Reconciliation: Bridges the gap, keeping Redis approximately correct while maintaining perfect audit trail
3. Integration with Request Flow
The budget check sits in the Auction Logic phase:
Before:
- Auction Logic (5ms): Sort by eCPM, select winner
After (with BML):
- Auction Logic (8ms avg, 10ms p99):
- Sort by eCPM, select winner: 3ms
- Budget check (BML): 3ms avg (5ms p99) ← NEW
- Overhead: 2ms
- If budget OK: serve ad
- If budget exhausted: try next bidder (repeat check)
Updated Request Flow Timing:
Complete request path latency breakdown:
| Component | Latency | Notes |
|---|---|---|
| Network + Gateway | 15ms | |
| User Profile | 10ms | |
| Integrity Check | 5ms | Fraud detection (BEFORE RTB) |
| Feature Store | 10ms | |
| Ad Selection | 15ms | |
| ML Inference | 40ms | (parallel execution) |
| RTB Auction | 100ms | (parallel, critical path) |
| Auction + Budget Check | 8ms | Budget enforcement |
| Response | 5ms | |
| Total | 143ms | (5-7ms buffer to 150ms SLO) |
Mathematical Proof: Bounded Overspend of $5 per Campaign
Theorem: Maximum overspend per campaign is bounded to the inaccuracy_bound value ($5).
Proof:
Define:
- \(B\) = Daily budget limit
- \(S(t)\) = Recorded spend at time \(t\)
- \(\Delta\) = Inaccuracy bound ($5)
- \(c_i\) = Cost of request \(i\)
The Lua script allows spend if: $$S(t) + c_i \leq B + \Delta$$
Worst case scenario: Multiple concurrent requests hit Redis simultaneously before the spend counter updates.
Maximum concurrent overshoot:
At most \(\Delta\) dollars can be spent beyond the limit because:
- Once \(S(t) > B\), the Lua script rejects ALL future requests
- The maximum “in-flight” spend that can sneak through is bounded by \(\Delta\)
- Even if 1000 requests arrive at the exact same nanosecond, Redis executes Lua scripts serially
Mathematical upper bound:
$$\text{Total Spend} \leq B + \Delta$$
$$\text{Overspend} = \max(0, \text{Total Spend} - B) \leq \Delta = \$5$$
Practical example:
Campaign has $1000 daily budget with $5 inaccuracy bound:
- True limit in Lua script: $1005
- Maximum possible spend: $1005
- Maximum overspend: $5 (0.5% of budget)
- Legally acceptable under standard advertising contracts
Alternative Explanation: In-Flight Requests Model
The inaccuracy_bound parameter ($5) can also be derived from system characteristics rather than configured arbitrarily. This approach calculates the bound based on request latency and throughput.
Parameters:
- \(Q_{campaign}\) = Requests per second for this campaign (e.g., 1,000 QPS)
- \(T_{req}\) = Request latency (150ms P99)
- \(L\) = Average ad cost ($0.005 per impression)
In-flight requests calculation:
When a budget counter hits zero, there are already requests in-flight that checked the budget as “available”:
$$R_{inflight} = Q_{campaign} \times T_{req} = 1,000 \times 0.15 = 150 \text{ requests}$$
Maximum overspend from in-flight requests:
If all in-flight requests complete (worst case):
$$Overspend_{max} = R_{inflight} \times L = 150 \times \$0.005 = \$0.75$$
Connecting both models:
The inaccuracy_bound parameter ($5) provides 10× safety margin over the calculated in-flight overspend ($0.75):
- Configuration parameter:
inaccuracy_bound = $5(set in Lua script) - Actual worst-case: ~$0.75 from in-flight requests
- Why the gap?: Accounts for traffic bursts, retry storms, circuit breaker delays
Both models are valid:
inaccuracy_boundmodel: What we configure in the system (Lua script parameter)- In-flight requests model: Why that configuration is sufficient (derived from system behavior)
For typical campaigns ($1,000-$10,000 daily budgets), both approaches yield overspend ≤0.5%, meeting the ≤1% financial accuracy requirement.
4. Handling Reconciliation Drift
Problem: Redis counter drifts from CockroachDB source of truth due to:
- Redis cache misses/restarts
- Delayed reconciliation
- Clock skew
Solution: Periodic Sync Procedure (runs every 60s):
Algorithm:
-
Query Source of Truth
- For each active campaign, query CockroachDB billing ledger
- Compute true cumulative spend:
SUM(spend) WHERE campaign_id = X - This is the authoritative value (immutable audit trail)
-
Update Redis Cache
- Write true spend value to Redis hash for this campaign
- Overwrite any stale or drifted value
- Redis now reflects accurate state from source of truth
-
Detect and Alert on Drift
- Read current Redis value before overwriting
- Calculate drift:
|true_spend - redis_spend| - If drift exceeds threshold ($50):
- Alert operations team via PagerDuty
- Log discrepancy for investigation
- Common causes: Redis node restart, delayed reconciliation, split-brain scenario
Why Drift Happens:
- Redis restarts: Counter resets to 0, reconciliation hasn’t caught up yet
- Reconciliation lag: 30-60s delay between spend and CockroachDB commit
- Network partition: Redis shard temporarily isolated from reconciliation stream
Why Drift is Acceptable:
- Maximum drift bounded by reconciliation window: $X spent in 60s
- For typical campaign ($1,000/day budget): 60s ≈ $0.70 at uniform pacing
- Actual drift usually <$10 (well within $5 inaccuracy bound per transaction)
- Periodic sync corrects drift before it accumulates
Failure Mode: Tier 3 Reconciliation Outage
If Flink job or Kafka become unavailable:
- Tier 1 continues operating: Budget checks work normally (Redis is independent)
- Impact: Audit trail writing to CockroachDB is paused
- Detection: Periodic sync (60s) detects drift > $50, alerts operations team via PagerDuty
- Recovery: When Flink recovers, processes backlog from Kafka (Kafka retains events for 7 days)
- Maximum data loss: None - Kafka retention ensures event replay capability
- Bounded risk: Redis continues enforcing spend limits, preventing unbounded overspend
This failure mode demonstrates graceful degradation: critical path (Tier 1) remains operational while audit trail temporarily lags. Financial accuracy is maintained via bounded inaccuracy, audit completeness is recovered via Kafka replay.
Why This Works at 1M QPS:
- Sharding: Redis cluster sharded by
campaign_id(100+ shards) - Per-shard throughput: 10K QPS per shard (well within Redis capacity)
- Latency: Lua script execution: 1-3ms, network RTT: 1-2ms = 3-5ms total
- Bounded inaccuracy: $5 overspend is legally acceptable (0.05-0.5% of typical campaign budgets)
Why CockroachDB Alone Doesn’t Work:
- Latency: 10-15ms p99 (too slow for critical path)
- Throughput: Would require complex sharding strategy
- Contention: Hot campaigns would create write bottlenecks
- Cost: 3× more expensive than Redis for high-frequency operations
Trade-offs:
| Approach | Latency | Accuracy | Cost | Scalability |
|---|---|---|---|---|
| CockroachDB only | 10-15ms (slow) | Perfect | High | Limited |
| Redis only | 5ms | Bounded ($5) | Low | Excellent |
| BML (both tiers) | 5ms | Bounded + audited | Medium | Excellent |
Idempotency Protection: Defending Against Double-Debits (CRITICAL)
The Problem: Double-Debit Risk
The BML architecture above handles budget enforcement correctly during normal operation, but lacks defense against a critical failure scenario: message replay after service crashes.
Failure scenario:
- Ad Server Orchestrator (AS) processes ad request, runs auction, selects winning ad
- AS calls Atomic Pacing Service → Redis Lua script successfully debits campaign budget
- AS crashes before sending response to client (network issue, pod restart, out-of-memory)
- Client doesn’t receive response, retries the same request (standard retry behavior)
- AS processes retry, runs auction again, debits budget AGAIN (double-debit for single impression)
- Result: Double-debit violates ≤1% billing accuracy constraint
Why this violates financial integrity:
- At 1M QPS with 0.1% retry rate: 1,000 retries/second (0.1% of total traffic)
- Without idempotency protection: 100% of retries = double billing on that traffic segment
- Impact magnitude: 0.1% traffic × 2× billing = +0.1% gross overbilling = systematic >10× violation of ≤1% accuracy constraint
- Consequence: Catastrophic for advertiser trust, payment processor compliance, potential regulatory/legal liability
Solution: Idempotency Key Store
The Atomic Pacing Service must implement idempotent budget deductions using a Redis-backed idempotency key mechanism.
Architecture:
graph TB
REQ[Ad Request
client_request_id: abc123] --> AS[Ad Server Orchestrator]
AS --> GEN[Generate Idempotency Key
UUID + Timestamp
Key: idem:campaign_X:abc123]
GEN --> LUA[Redis Lua Script
Atomic Check-and-Set]
LUA --> CHECK{Key exists?}
CHECK -->|YES| CACHED[Return cached result
DEDUP: Budget NOT debited
Return previous debit amount]
CHECK -->|NO| DEBIT[Debit budget: -$2.50
Store key with TTL=30s
Value: debit_amount=$2.50]
CACHED --> RESP1[Return to client
Idempotent response]
DEBIT --> RESP2[Return to client
Fresh debit]
TTL[TTL Expiration
After 30 seconds] -.->|Auto-delete key| CLEANUP[Key removed
Prevents memory leak]
style CHECK fill:#fff3e0,stroke:#f57c00
style CACHED fill:#c8e6c9,stroke:#4caf50
style DEBIT fill:#ffccbc,stroke:#ff5722
Implementation: Enhanced Redis Lua Script
The Lua script must perform atomic check-and-set to guarantee exactly-once semantics:
Enhanced Lua script logic:
The script performs atomic check-and-set operations in a single Redis transaction:
- Check idempotency key: GET operation on the idempotency key
- If key exists: Return cached result (deduplication - budget was already debited)
- Signals to caller:
deduplicated=true, returns previous debit amount - Critical: Budget is NOT debited again (exactly-once guarantee)
- Signals to caller:
- If key doesn’t exist:
- Check budget:
current_spend + cost <= budget_limit + inaccuracy_bound - If budget OK: Debit budget AND store idempotency key atomically
- DECRBY operation: Deduct cost from budget counter
- SETEX operation: Store idempotency key with TTL (30 seconds)
- Key value contains: debit amount, timestamp, transaction metadata
- If budget exhausted: Return error (no debit, no key stored)
- Check budget:
Idempotency Key Naming Convention:
Keys follow a hierarchical pattern for efficient sharding and collision prevention:
Pattern: idem:campaign_{campaign_id}:{client_request_id}_{timestamp_bucket}
Components:
- Prefix (
idem): Namespace for idempotency keys (separates from budget counters) campaign_id: Ensures keys are scoped per campaign (enables Redis cluster sharding)client_request_id: Unique identifier from client (UUID v4, trace ID, or request hash)timestamp_bucket: Rounded timestamp (prevents collision across time windows)
Example: idem:campaign_12345:req_abc123_1704067200
Why this format works:
- Sharding: Campaign ID in key prefix ensures same campaign’s keys route to same Redis shard
- Uniqueness: Combination of campaign + request_id + timestamp eliminates collisions
- Queryability: Pattern matching enables monitoring (
SCAN idem:campaign_12345:*)
TTL Rationale (30 seconds):
- Too short (5s): Client retries beyond TTL window → double-debit
- Too long (5min): Memory waste, prevents legitimate repeat requests from same client
- 30s: Balances retry window coverage (typical client timeout: 5-15s, allows 2-3 retry attempts) with memory efficiency
Memory overhead:
- Key size: ~80 bytes
- Value size: ~20 bytes (debit amount + metadata)
- Total per key: ~100 bytes
- At 1M QPS with 0.1% retry rate: 1K keys/sec × 30s TTL = 30K active keys × 100 bytes = 3MB
- Negligible compared to Redis capacity
Why Lua Script is Critical:
Redis Lua scripts provide atomic execution guarantee - the foundation of idempotency protection.
Without Lua (separate GET + DECRBY operations), race conditions are inevitable:
- Thread A: GET key → not found
- Thread B: GET key → not found (race window - both threads see “not found”)
- Thread A: DECRBY budget
- Thread B: DECRBY budget (double-debit! - both threads deduct)
Lua script runs single-threaded in Redis, eliminating race conditions:
- Redis blocks all other operations while Lua script executes
- GET + DECRBY + SETEX become a single atomic transaction
- Industry standard: This pattern is used by Stripe, GitHub, Shopify for financial operations
Client-Side Requirements:
Idempotency requires client cooperation - the contract between client and server:
-
Generate stable request IDs: Client must use consistent ID for retries
- Use UUID v4 generated once per original request (industry standard: Stripe, PayPal, AWS use this pattern)
- Include in retry attempts: same request_id for all retries of original request
- Why stable IDs matter: Different ID on retry = treated as new request = double-debit
-
Include request ID in API call:
- HTTP header (recommended):
X-Request-ID: abc123-def456-ghi789(RFC 7231 standard) - Or request body:
request_idfield in JSON payload - Server must validate: Reject requests with missing/malformed IDs in strict mode
- HTTP header (recommended):
-
Retry policy with exponential backoff (prevents thundering herd):
- 1st retry: 100ms + random jitter (0-50ms)
- 2nd retry: 500ms + random jitter (0-250ms)
- 3rd retry: 2s + random jitter (0-1s)
- Max retries: 3 (total window: ~3s, well within 30s TTL)
- Jitter prevents: Synchronized retries from multiple clients overwhelming server
Edge Cases and Failure Modes:
Real-world systems must handle imperfect clients and infrastructure failures:
Case 1: Client doesn’t provide request_id (Legacy client or API misuse)
- Server-side fallback: Generate deterministic ID from request hash
- Formula:
SHA256(campaign_id + user_id + ad_id + timestamp_bucket) - Behavior: Prevents same user clicking same ad within 30s window from duplicate debits
- Trade-off: Different users clicking same ad will have different IDs (correct - these are genuinely different requests)
- Best practice: Log missing-request-id events to track non-compliant clients
Case 2: Redis key expires during retry window (Timing edge case)
- Scenario: Client retries >30s after original request
- Frequency: Rare - requires extreme network delays or client hanging
- Behavior: Treated as new request, budget debited again
- Mitigation: Log as
expired-key-retryfor audit trail, monitor frequency - Acceptable risk: Client already timed out by app standards (5-15s), unlikely to complete transaction
- Industry precedent: Stripe’s idempotency keys expire after 24 hours with same behavior
Case 3: Redis unavailable (Failover scenario)
- Scenario: Redis cluster failover, network partition, or master election
- Impact: Idempotency protection temporarily unavailable (<5s typical failover time)
- Behavior: Requests processed without deduplication during failover window
- Consequences: Small window of potential double-debits
- Mitigation strategies:
- Monitor Redis availability, alert on failover events
- Circuit breaker: Reject requests during known Redis outages (trade availability for correctness)
- Post-hoc reconciliation: Detect duplicate transactions in audit trail, issue refunds
- Design decision: Accept <5s vulnerability window vs rejecting all traffic (99.9% availability = 43 minutes/month downtime acceptable)
Monitoring:
Track idempotency metrics:
- Deduplication rate:
deduplicated_requests / total_requests(expect: 0.1% from retries) - Key hit rate: Percentage of requests that hit existing keys (should match retry rate)
- Key expiry before use: Keys that expire before retry arrives (should be rare)
- Memory usage: Active idempotency keys (should stay <10MB)
Alerts:
- P1: Deduplication rate > 1% (abnormal retry rate, possible client bug or attack)
- P2: Key expiry rate > 5% (TTL too short, increase to 60s)
Industry Comparison: How This Matches Best Practices
Our idempotency design aligns with proven patterns from leading payment and financial platforms:
| Aspect | Our Design | Stripe | AWS | PayPal | Industry Best Practice |
|---|---|---|---|---|---|
| Request ID Source | Client-generated UUID | Client-generated UUID | Client-generated UUID | Client-generated UUID | Client-controlled |
| ID Header | X-Request-ID | Idempotency-Key | x-amz-idempotency-token | Custom header | HTTP header |
| Storage | Redis (30s TTL) | Database (24h TTL) | DynamoDB (1h TTL) | Database (24h) | Persistent store with TTL |
| Atomicity | Lua script | Database transaction | DynamoDB ConditionExpression | Database transaction | Atomic check-and-set |
| Scope | Per campaign | Per API key | Per request type | Per merchant | Scoped to prevent conflicts |
| Retry behavior | Return cached result | Return cached result (HTTP 200) | Return cached result | Return cached result | Idempotent response |
| TTL rationale | 30s (high-frequency) | 24h (low-frequency) | 1h (moderate) | 24h (low-frequency) | Context-dependent |
Why our TTL differs (30s vs industry’s 24h):
- Request frequency: Ad serving = 1M QPS vs payments = 1K QPS (1000× higher volume)
- Memory constraints: 30K active keys vs 86M keys (24h retention at our scale = 2.5GB memory)
- Use case: Real-time ad auctions complete in <3s vs payment settlement in hours/days
- Trade-off accepted: Small risk of late retries (>30s) vs memory efficiency at scale
Alternative approaches considered:
-
Database-backed idempotency (Stripe’s approach)
- Pros: Longer TTL (24h+), stronger durability guarantees
- Cons: 10-15ms latency (violates our 5ms budget), poor scalability at 1M QPS
- Decision: Rejected - latency unacceptable for critical path
-
DynamoDB with conditional writes (AWS approach)
- Pros: Managed service, strong consistency, regional replication
- Cons: 8ms p99 latency (vs Redis 3ms), higher cost ($1000/month vs Redis $200/month)
- Decision: Rejected - Redis already deployed for budget counters, reuse existing infrastructure
-
In-memory only (no persistence) (Dangerous pattern)
- Pros: Ultra-low latency (<1ms)
- Cons: Lost on server restart, no failover protection
- Decision: Rejected - violates financial integrity requirements
Why Redis + Lua is optimal for our use case:
- Already deployed for budget counters (infrastructure reuse)
- Sub-5ms latency fits critical path budget
- Atomic operations via Lua scripts (proven pattern)
- TTL-based cleanup (memory efficiency)
- Cluster mode supports 1M+ QPS
- Trade-off: Shorter TTL (30s) vs database approaches (24h), but acceptable for real-time auctions
Impact Assessment:
Without idempotency protection:
- Retry rate: 1M QPS × 0.1% = 1K retries/sec (typical under load)
- Assuming 10% race conditions cause double-debits: 100 billing errors/sec
- Billing accuracy violation: 100/1M = 0.01% systematic overbilling rate
- Consequence: 10× violation of ≤1% accuracy constraint → catastrophic for financial integrity
With idempotency protection:
- Double-debits prevented: 100% of retry-induced billing errors eliminated
- Implementation overhead: ~3MB Redis memory + 0.5ms latency (30s TTL × 1K keys/sec)
- Operational cost: Negligible - adds 10% to existing Redis footprint
- Business value: Prevents systematic billing violations that would be catastrophic for advertiser trust and payment processor compliance
ROI: Infinite - The implementation cost (minimal Redis overhead) is negligible compared to preventing systematic financial integrity violations that could result in platform-wide advertiser churn and regulatory liability.
Conclusion:
The Bounded Micro-Ledger architecture achieves the “impossible trinity” of:
- Low latency (5ms budget check)
- Financial accuracy (mathematically proven $5 max overspend + idempotency protection against double-debits)
- High throughput (1M+ QPS)
Critical addition: Idempotency protection is non-negotiable for production deployment. Without it, the system violates financial integrity guarantees during routine failure scenarios (crashes, retries, network issues).
This is the only viable architecture for real-time budget pacing at scale while maintaining financial integrity.
Summary: Data Consistency Meets Revenue Optimization
This post explored the three critical data systems that enable real-time ad platforms to serve 1M+ QPS with sub-150ms latency while maintaining financial accuracy: distributed caching for fast reads, eCPM-based auctions for fair price comparison, and atomic budget control for spend accuracy.
Three Critical Systems:
1. Distributed Caching Architecture
Problem: Serve 1M QPS without overwhelming databases Solution: Two-tier cache architecture with database fallback
| Layer | Technology | Latency | Use Case | Cache Hit Rate |
|---|---|---|---|---|
| L1 Cache | Caffeine (in-process) | 0.001ms | Hot user profiles | 60% |
| L2 Cache | Valkey (distributed) | 5ms | Warm data, feature vectors | 25% |
| Database | CockroachDB | 20ms | Source of truth (cache miss) | 15% of requests |
Key decisions:
- Cache-aside pattern: Application controls caching (vs cache-through)
- TTL-based invalidation: 5min profiles, 1hour features (vs event-driven)
- Write-through for financial: Budget updates bypass cache → database first
- Read-heavy optimization: 95% read, 5% write workload
Performance impact:
- 85% cache hit rate (L1: 60% + L2: 25%)
- 15% database queries (cache miss)
- Avg latency: \(0.60 × 0.001ms + 0.25 × 5ms + 0.15 × 20ms = 4.25ms\)
- vs database-only: ~40-60ms average
- 10-15× latency reduction enables sub-10ms budget for User Profile and Feature Store
2. Auction Mechanism Design
Problem: Compare $10 CPM bid with $0.50 CPC bid - which is worth more?
Solution: eCPM normalization using CTR prediction
eCPM formula:
$$ \begin{array}{ll} \text{CPM bid:} & eCPM = \text{CPM (direct)}\\ \text{CPC bid:} & eCPM = \text{CPC} \times \text{CTR} \times 1000 \\ \text{CPA bid:} & eCPM = \text{CPA} \times conversion_{rate} \times \text{CTR} \times 1000 \end{array} $$
Example:
- Ad A: $10 CPM → eCPM = $10
- Ad B: $0.50 CPC, predicted CTR = 2% → eCPM = $0.50 × 0.02 × 1000 = $10
- Fair competition: Both have equal expected revenue
Auction type decision: First-Price
- Simplicity: Winner pays their bid (vs second-price complexity)
- Transparency: Advertisers see exact costs
- Revenue: DSPs bid conservatively, but combined with ML-scored internal inventory, captures full value
- Industry trend: Programmatic advertising moved from second-price to first-price (2017-2019)
Latency: 3ms for auction logic (ranking + budget check excluded)
3. Budget Pacing: Bounded Micro-Ledger
Problem: Prevent budget overspend across 300 distributed ad servers without centralizing every spend decision
Solution: Bounded Micro-Ledger with Redis atomic counters (detailed in Budget Pacing: Distributed Spend Control)
Core Architecture:
- Pre-allocation: Daily budget → allocate proportional hourly amounts to Redis counters
- Atomic deduction:
DECRBY campaign:123:budget <cost>(5ms p99) - Idempotency: Redis cache of request IDs prevents double-debits during retries
- Reconciliation: Every 10min, compare Redis totals vs CockroachDB source of truth
- Bounded overspend: Mathematical guarantee ≤0.1% per campaign (≤1% aggregate)
Why this works:
- No centralized bottleneck: Redis distributed across regions
- Atomic operations: DECRBY prevents race conditions
- Low latency: 3ms avg, 5ms p99 (vs 50-100ms for distributed transactions)
- Financial accuracy: Mathematically proven bounds using two complementary models:
- Configuration model:
inaccuracy_boundparameter (e.g., $5) in Lua script - Behavioral model: In-flight requests (150 req × $0.005 = $0.75 typical overspend)
- Configuration model:
Performance Impact:
| Metric | Without Budget Pacing | With Bounded Micro-Ledger | Improvement |
|---|---|---|---|
| Latency | Centralized DB check (50-100ms) | Redis atomic counters (3ms avg, 5ms p99) | 17-30× faster |
| Overspend | Unbounded (race conditions) | ≤0.1% per campaign (mathematical guarantee) | Bounded |
| Availability | Single point of failure | Distributed Redis (multi-region) | No bottleneck |
Key Trade-offs:
- Redis over Memcached: +30% memory cost → atomic DECRBY prevents race conditions
- Idempotency cache: +0.5ms latency, +500MB Redis → eliminates 100 billing errors/sec
- Pre-allocation: +10min reconciliation overhead → enables distributed 3ms spend checks
- Bounded inaccuracy: Accept ≤1% variance → avoid 50-100ms centralized DB latency
See detailed implementation for Lua scripts, reconciliation algorithms, idempotency protection, and mathematical proofs.