Production Operations: Fraud, Multi-Region & Operational Excellence

Introduction: From Design to Production

Architecture on paper ≠ production system. You can design the most elegant distributed architecture - perfect latency budgets, optimal caching strategies, fair auction mechanisms - and it will fail in production without addressing operational realities.

The gap between design and production:

Most architecture discussions focus on the “happy path”:

Production systems face harsher realities:

Why this matters at scale:

At 1M+ QPS serving 400M daily active users:

What this post covers:

This post addresses eight critical production topics that separate proof-of-concept from production-grade systems:

  1. Fraud Detection - Pattern-based bot detection filtering 20-30% of malicious traffic BEFORE expensive RTB fan-out. Multi-tier detection (L1 real-time, L2 behavioral, L3 ML-based) with specific patterns for click farms, SDK spoofing, and domain fraud.

  2. Multi-Region Deployment - Active-active architecture across 3 AWS regions with CockroachDB automatic failover (30-60s) and Route53 health-check routing (2min). Handling split-brain scenarios, regional budget pacing, and bounded overspend during failover.

  3. Schema Evolution - Zero-downtime migrations using dual-write patterns, backward/forward compatible schema changes, and gradual rollouts. Changing the database while serving 1M QPS without dropping a single request.

  4. Clock Synchronization - Why NTP (±50-100ms) isn’t good enough for financial ledgers. Hybrid Logical Clocks (HLC) for distributed timestamp ordering without TrueTime hardware. Preventing clock-drift-induced budget discrepancies.

  5. Observability & Operations - SLO-based monitoring with error budgets (43min/month at 99.9%). RED metrics (Rate, Errors, Duration), distributed tracing for 150ms request paths, and structured logging at 1M+ QPS. Mean Time to Recovery (MTTR) as key operational metric.

  6. Security & Compliance - Zero-trust architecture (every request authenticated/authorized), encryption (TLS 1.3, AES-256 at rest), audit trails for financial compliance (GDPR/CCPA), and defense against insider threats.

  7. Production Operations - Progressive rollouts (1% → 10% → 50% → 100%), automated rollback triggers, chaos engineering validation, and incident response playbooks. Deployment safety at scale.

  8. Resilience & Failure Scenarios - Testing the architecture under extreme conditions: regional disasters, malicious insiders, and business model pivots. Validating theoretical resilience through controlled chaos.

The core insight:

Production-grade systems require defense in depth across all dimensions:

Each dimension reinforces the others. Fraud detection protects the business. Multi-region protects availability. Zero-downtime migrations protect error budgets. Clock synchronization protects financial integrity. Observability protects MTTR. Security protects against insider threats. Progressive rollouts protect against bad deployments. Chaos testing validates it all actually works.

Broader applicability:

These patterns - fraud detection, multi-region failover, zero-downtime migrations, distributed time synchronization - apply beyond ad tech to high-stakes distributed systems:

A key insight: operational excellence isn’t bolted on after launch - it must be designed into the system from the start. Circuit breakers, observability hooks, audit trails, multi-region data replication - these aren’t implementation details, they’re architectural requirements.

Let’s explore each production topic and how they integrate into a cohesive operational strategy.


Fraud Detection: Pattern-Based Abuse Detection

Architectural Driver: Financial Accuracy - While rate limiting (covered in the architecture post) controls request volume, fraud detection identifies malicious patterns. A bot clicking 5 ads/minute might pass rate limits but shows suspicious behavioral patterns. Both mechanisms work together: rate limiting stops volume abuse, fraud detection stops sophisticated attacks.

What Fraud Detection Does (vs Rate Limiting):

Fraud detection answers: “Are you malicious?”

Rate limiting answers: “Are you requesting too much?” (covered in the architecture post)

Problem: Detect and block fraudulent ad clicks in real-time without adding significant latency.

CRITICAL: Integrity Check Service in Request Critical Path

The Integrity Check Service (L1 fraud detection) runs in the synchronous request path immediately after User Profile lookup and BEFORE the expensive RTB fan-out to 50+ DSPs. This placement is critical for cost optimization:

Cost Impact of Early Fraud Filtering: Without early filtering: RTB requests go to 50+ DSPs for ALL traffic, including 20-30% bot traffic

Solution: 5ms Integrity Check Service blocks 20-30% of fraud BEFORE RTB fan-out, eliminating this massive bandwidth waste.

Trade-off Analysis:

Fraud Types:

  1. Click Farms: Bots or paid humans generating fake clicks
  2. SDK Spoofing: Fake app installations reporting ad clicks
  3. Domain Spoofing: Fraudulent publishers misrepresenting site content
  4. Ad Stacking: Multiple ads layered, only top visible but all “viewed”

Detection Strategy: Multi-Tier Filtering

    
    graph TB
    REQ[Ad Request] --> UP[User Profile
10ms] UP --> L1{L1: Integrity Check Service
5ms CRITICAL PATH
Runs BEFORE RTB} L1 -->|Known bad IP| BLOCK1[Block Request
Bloom Filter
No RTB call made] L1 -->|Pass| RTB[RTB Auction
100ms
50+ DSPs] RTB --> L2{L2: Behavioral
Post-click analysis
Async} L2 -->|Suspicious pattern| PROB[Flag for Review
50% sampled] L2 -->|Pass| L3{L3: ML Model
10ms async
Post-impression} L3 -->|Fraud score > 0.8| BLOCK2[Block User
Update Bloom filter] L3 -->|Pass| ALLOW[Legitimate Traffic] PROB -->|If sampled| BLOCK3[Add to Training Data] PROB -->|If not sampled| ALLOW BLOCK1 --> LOG[Log to fraud DB] BLOCK2 --> LOG BLOCK3 --> LOG style BLOCK1 fill:#ff6b6b style BLOCK2 fill:#ff6b6b style BLOCK3 fill:#ff6b6b style L1 fill:#ffdddd style ALLOW fill:#51cf66

L1: Integrity Check Service (5ms critical path, 20-30% fraud caught)

Implementation: Bloom filter + IP reputation + Device fingerprinting

Component: Integrity Check Service - Runs in synchronous request path BEFORE RTB fan-out

Decision flow (5ms budget):

  1. Check IP against Bloom filter (~0.1ms) → if match, confirm with Redis (1ms) → BLOCK if confirmed (no RTB call)
  2. Check device ID against IMPOSSIBLE_DEVICES list → BLOCK if invalid
  3. Validate User-Agent format → BLOCK if malformed
  4. Check device/OS combination validity → BLOCK if impossible (SDK spoofing)
  5. Basic rate checks: Requests/sec from this IP/device → BLOCK if exceeds threshold
  6. PASS to RTB/ML paths if all checks pass

Bloom filter characteristics:

Latency breakdown:

Examples caught by L1:

Critical Impact: Blocking at L1 saves 100-200KB bandwidth per blocked request (2-4KB × 50 DSPs) + DSP processing time

L2: Behavioral Analysis (Async post-click, 40-50% additional fraud caught)

Component: Fraud Analysis Service - Runs ASYNCHRONOUSLY after ad click/impression events, NOT in request critical path

When it runs: Triggered by click/impression events published to Kafka, analyzes patterns over time

Feature extraction (5ms per event):

  1. Redis lookup: Fetch user history (3ms) - last 100 impressions, clicks, IPs, device changes
  2. Calculate features (2ms):
    • Time patterns: clicks/hour, avg interval between clicks, timing stddev (bots have low variance)
    • CTR analysis: click rate over last 100 impressions
    • Device diversity: unique IPs in last 24h, device changes in last 7 days
    • Behavioral entropy: IP diversity, category spread

Rule-based thresholds → SUSPICIOUS if ANY:

Actions:

Processing time:

Examples caught by L2:

L3: ML Model (10ms latency, 10-15% additional fraud caught)

Gradient Boosted Decision Tree (GBDT) model:

Feature enrichment and inference (10ms):

  1. Take 15 features from L2 (time patterns, CTR, device diversity, entropy)

  2. Add 25 computed features:

    • Temporal: hour of day, day of week, is weekend
    • Historical aggregates: lifetime clicks, account age, avg session duration
    • Reputation scores: device fraud score, IP fraud score (from lookup tables)
    • Publisher context: publisher fraud rate
  3. GBDT inference (200 trees, depth 6) → fraud score 0.0-1.0

Decision thresholds:

Model characteristics:

Examples caught by L3:

Performance Characteristics:

TierLatencyFraud CaughtFalse Positive RateCost
L10ms20-30%<0.1%Negligible (Bloom filter)
L25ms60-80% cumulative1-2%Low (Redis lookup + compute)
L310ms70-95% cumulative0.5-1%Medium (GBDT inference)
Total0-15ms70-95% total~1-2%Acceptable

Signal Loss Impact on Fraud Detection:

Fraud detection becomes harder when user_id is unavailable (40-60% of mobile traffic due to ATT/Privacy Sandbox). Without stable identity:

Mitigation strategies for anonymous traffic:

Expected accuracy degradation: AUC drops from 0.92 (identified) to ~0.82-0.85 (anonymous) - still effective but less precise.

Latency impact on overall SLO:

Fraud detection runs in PARALLEL with ad selection (as shown in the architecture post’s critical path analysis):

Ad serve critical path:

    
    gantt
    title Ad Serve Critical Path (140ms Total)
    dateFormat x
    axisFormat %L

    section Sequential 0-25ms
    Network 10ms               :done, 0, 10
    Gateway 5ms                :done, 10, 15
    User Profile 10ms          :done, 15, 25

    section Parallel Execution
    Ad Selection + ML 65ms     :crit, 25, 90
    Fraud Detection 0-15ms     :active, 25, 40

    section RTB + Final
    RTB Auction 100ms          :crit, 90, 190
    Final Processing 10ms      :done, 190, 200

Key insight: Fraud detection (0-15ms) runs in parallel with Ad Selection + ML (65ms) and completes before the ML path finishes. This means fraud detection adds ZERO latency to the critical path—the request must wait for ML anyway, so fraud checks happen “for free” during that wait time.

Monitoring and Alerting:

Key metrics to track:

Critical alerts (P1):

Warning alerts (P2):

Impact Analysis: Fraud Prevention vs False Positives

Baseline scenario (typical ad platform at scale):

Fraud detection effectiveness:

If catching 80% of fraud:

False positive trade-off:

If 2% false positive rate:

Net impact assessment:

MetricValueNotes
Fraud prevented~2.4% of trafficPrevents 5-15% revenue loss from fraud
False positives~1.94% of traffic1-2% revenue opportunity cost
Net benefit3-13% gross revenue protectedNet positive after false positive cost
Infrastructure overhead<0.5% of infrastructure spendRedis, HBase, ML training - negligible vs benefit
ROI multiplier50-100×Benefit-to-infrastructure-cost ratio

Decision: Fraud detection is operationally critical. The 2% false positive rate is an acceptable trade-off to prevent fraud-induced billing disputes (which would be catastrophic for advertiser trust and payment processor relationships).

Critical Testing Requirements

Architectural Driver: Financial Accuracy & Trust - Two testing aspects are non-negotiable for ads platforms: proving financial accuracy (≤1% budget overspend) and validating performance at scale (1M+ QPS). Traditional testing approaches miss both.

The testing gap: Unit tests validate individual components. Integration tests validate service interactions. End-to-end tests validate request flows. But none of these prove the two critical claims that make or break an ads platform:

  1. Financial accuracy under distributed contention: Can 300+ servers concurrently decrement shared budgets without violating the ≤1% overspend guarantee?
  2. Performance at scale under realistic load: Does the system actually handle 1M QPS sustained load with P95 < 150ms latency, or does it collapse at 800K?

Why traditional testing insufficient:

This section covers three specialized testing strategies required for financial systems at scale.

Financial Accuracy Validation: Proving the ≤1% Budget Overspend Claim

The Challenge

Part 1’s architecture claims ≤1% budget overspend despite distributed contention. How do we prove this claim before production deployment?

The problem: Multiple servers (300+) concurrently decrementing shared advertiser budgets at 1M QPS creates inevitable race conditions. Without proper atomic operations (Part 3’s Redis DECRBY), budget overspend could reach 50-200% as servers race to approve the last available impressions.

Claim to validate: Part 1 guarantees ≤1% overspend through atomic distributed cache operations. This must be proven under realistic contention, not assumed.

Testing Methodology

Setup:

Validation:

  1. Track approved impressions per campaign (count every impression the system serves)
  2. Calculate actual spend: actual_spend = approved_impressions × CPM
  3. Assert: (actual_spend - budget) / budget ≤ 1% for 99.5%+ campaigns

Why this methodology works:

What this validates:

Why critical: Real advertisers will sue if systematic overspend >1%. This test detects the bug before it costs millions in refunds and destroyed trust.

Historical results from similar systems:

Scale Validation: Performance Testing Beyond Unit Tests

The problem: Systems that pass unit tests at 1K QPS often collapse at 800K-1M QPS due to emergent behaviors invisible at low scale.

Examples of scale-only failures:

Critical Load Scenarios

Scenario 1: Baseline - 1M QPS Sustained (1 hour)

Configuration:

Success criteria:

What this validates:

Scenario 2: Burst - 1.5M QPS Spike (Black Friday Simulation)

Configuration:

Success criteria:

What this validates:

Scenario 3: Degraded Mode - Simulated Service Failures

Configuration:

Success criteria:

What this validates:

Why these scenarios matter:

Unit tests can’t simulate distributed system behavior at scale:

Testing infrastructure: Dedicated load testing cluster (separate from production) with production-equivalent configuration (same instance types, same database sizing, same network topology).

Shadow Traffic: Financial System Validation Without User Impact

Why Standard Canary Insufficient for Financial Systems

Standard canary deployment:

Problem for ads platforms:

Risk unacceptable: Cannot afford even 10-minute bug detection window for financial code.

Shadow Traffic Approach: Zero User Impact Validation

Pattern:

Implementation:

  1. Traffic sampling: API Gateway duplicates 10% of requests

    • Original request → Primary service (v1.2.3) → user receives response
    • Duplicated request → Shadow service (v1.3.0) → response logged but discarded
  2. Response comparison:

    • Billing calculation diff: Shadow charges $5.02, primary $5.00 → flag for investigation
    • Latency regression: Shadow P95 = 160ms, primary P95 = 145ms → block deployment
    • Response divergence: 0.1% of requests return different ad_id → manual review
  3. Validation metrics:

    • Billing accuracy: Shadow vs primary spend must match within 0.1%
    • Latency: Shadow P95 must be ≤ primary P95 + 5ms (no regression)
    • Error rate: Shadow errors must be ≤ primary errors (no new failures)

Ramp schedule:

WeekShadow Traffic %ValidationDecision
Week 11%Stability check (memory leaks, crashes)Proceed or abort
Week 210%Full validation (billing, latency, errors)Proceed or abort
Week 3Canary 10%Promote to canary only if zero billing discrepanciesProceed or abort
Week 4+Ramp to 100%Standard progressive rolloutFull deployment

Why this works:

Trade-off:

Value: Preventing a single $100K billing error (30 min bug exposure at 10% canary) pays for years of shadow infrastructure costs.

Shadow Traffic Flow Diagram

    
    graph TB
    PROD[Production Traffic
1M QPS] GW[API Gateway
Traffic Splitter] PRIMARY[Primary v1.2.3
Serves response to user
100% traffic] SHADOW[Shadow v1.3.0
Logs results, discards
10% sampled] USER[User receives response] COMPARE[Comparison Engine
Latency, Errors, Response Diff
Billing Calculation Validation] PROD --> GW GW -->|100%| PRIMARY GW -->|10% duplicate| SHADOW PRIMARY --> USER PRIMARY -.->|Metrics| COMPARE SHADOW -.->|Metrics| COMPARE COMPARE -->|Billing diff detected| ALERT[Alert + Block Promotion] COMPARE -->|Latency regression| ALERT COMPARE -->|All validations pass| PROMOTE[Promote to Canary] style SHADOW fill:#ffffcc style COMPARE fill:#ccffff style ALERT fill:#ffcccc style PROMOTE fill:#ccffcc

Diagram explanation:

Section Conclusion

Three specialized testing strategies for financial systems:

  1. Financial accuracy testing: Validates Part 1’s ≤1% budget overspend claim under distributed contention (300 servers, 10M requests, intentional race conditions)

  2. Scale testing: Validates performance claims at production scale (1M QPS sustained, 1.5M burst, degraded mode scenarios that only manifest at high QPS)

  3. Shadow traffic: Validates financial code changes with zero user impact (catches billing bugs before canary exposure, preventing $15K+ errors)

Why critical:

Cross-references:

Trade-offs accepted:

With fraud detection protecting against malicious traffic and critical testing validating financial accuracy at scale, the platform is ready for multi-region deployment to ensure high availability.


Multi-Region Deployment and Failover

Architectural Driver: Availability - Multi-region deployment with 20% standby capacity ensures we survive full regional outages without complete service loss. At scale, even 1-hour regional outage represents significant revenue impact. Auto-failover within 90 seconds minimizes impact to <0.1% daily downtime.

Why Multi-Region:

Business drivers:

  1. Latency requirements: Sub-100ms p95 latency is physically impossible with single region serving global traffic. Speed of light: US-East to EU = ~80ms one-way, already consuming 80% of our budget. Regional presence required.

  2. Availability: Single-region architecture has single point of failure. AWS historical data: major regional outages occur 1-2 times per year, averaging 2-4 hours. Single outage can cause significant revenue loss proportional to platform scale and hourly revenue rate.

  3. Regulatory compliance: GDPR requires EU user data stored in EU. Multi-region enables data locality compliance.

  4. User distribution: for example 60% US, 20% Europe, 15% Asia, 5% other. Serving from nearest region reduces latency 50-100ms.

Normal Multi-Region Operation:

Region allocation (Active-Passive Model):

RegionUser BaseNormal TrafficRoleCapacity
US-East-140%400K QPSPrimary US100% + 20% standby
US-West-220%200K QPSSecondary US100% + 20% standby
EU-West-130%300K QPSEU Primary100% + 20% standby
AP-Southeast-110%100K QPSAsia Primary100% + 20% standby

Deployment model: Active-passive within region pairs. Each region serves local users (lowest latency), can handle overflow from neighbor region (geographic redundancy), but cannot handle full global traffic (cost prohibitive).

Trade-off accepted: 20% standby insufficient for full regional takeover, but enables graceful degradation. Full redundancy (200% capacity per region) would triple infrastructure costs.

Traffic Routing & DNS:

Global load balancing: AWS Route53 with geolocation-based routing + health checks.

Normal operation:

Health check mechanism:

Route53 Health Check Configuration:

Failover trigger: When health checks fail for 90 seconds (3 × 30s interval), Route53 marks region unhealthy and returns secondary region’s IP for DNS queries.

DNS TTL impact: Set to 60 seconds. After failover triggered, new DNS queries immediately return healthy region, existing client DNS caches expire within 60s (50% of clients fail over in 30s, 95% within 90s).

Why 60s TTL: Balance between fast failover and DNS query load. Lower TTL (10s) = 6× more DNS queries hitting Route53 nameservers. At high query volumes, this increases costs, but the primary concern is cache efficiency - shorter TTLs mean resolvers cache records for less time, reducing effectiveness of DNS caching infrastructure.

Health check vs TTL costs: Note that health check intervals (10s vs 30s) have different pricing tiers. The 6× query multiplier applies to DNS resolution, not health checks.

Data Replication Strategy:

CockroachDB (Billing Ledger, User Profiles):

Multi-region replication strategy:

Goal: Survive regional failures while maintaining data consistency and acceptable write latency.

Approach:

  1. Determine survival requirements: What failure scenarios must you tolerate?

    • Single AZ failure = need 3 replicas minimum (quorum of 2)
    • Single region failure = need cross-region distribution
    • Multiple concurrent failures = need higher replication factor
  2. Calculate replication factor: Based on consensus quorum requirements

    • Quorum size = floor(replicas / 2) + 1
    • To survive N failures and maintain quorum: replicas ≥ 2N + 1
    • Example: survive 1 region failure → need at least 3 replicas (quorum=2, can lose 1)
    • Example: survive 2 region failures → need at least 5 replicas (quorum=3, can lose 2)
  3. Replica placement strategy: Distribute across regions based on traffic and failure domains

    • Place more replicas in high-traffic regions (reduces read latency)
    • Ensure geographic diversity (regions should have independent failure modes)
    • Balance cost vs resilience (more replicas = higher storage cost)
  4. Trade-offs:

    • More replicas = better fault tolerance but higher cost and write latency
    • Fewer replicas = lower cost but reduced resilience
    • Write latency increases with geographic spread (cross-region = 60-225ms vs same-region = 5-20ms)

Write path: Writes acknowledged when quorum of replicas confirm (Raft consensus). Cross-region write latency ranges 60-225ms (dominated by network RTT).

Read path: Reads served by nearest replica with bounded staleness for eventually-consistent reads (stale reads acceptable for most use cases). Strong-consistency reads must hit the leaseholder (higher latency, but guaranteed fresh data).

Multi-Region Coordination Model

Consistency modes, read routing, and latency impacts per data type:

Data TypeStorageConsistency ModeRead RoutingWrite Latency Impact
Billing LedgerCockroachDBStrong (linearizable)Leaseholder only (cross-region)60-225ms (quorum across regions)
Campaign ConfigsCockroachDBStrong (linearizable)Leaseholder only (cross-region)60-225ms (quorum across regions)
User ProfilesCockroachDBEventual (bounded staleness)Nearest replica (local region)60-225ms (async after quorum)
Budget CountersRedis (regional)Local strong (per-region)Local region only (no replication)<1ms (in-region atomic ops)
User SessionsRedis (regional)Local strong (per-region)Local region only (no replication)<1ms (in-region atomic ops)
ML FeaturesRedis (regional)Eventual (cache)Local region only (rebuilt from Kafka)<1ms (local write, eventual consistency)

Key insights: Financial data accepts 60-225ms writes for strong consistency. User Profiles use local replicas (5-10ms reads, seconds staleness). Budget Counters achieve <1ms via regional isolation, accepting bounded loss during failures.

Cross-Region Write Latency Matrix

Measured round-trip time (RTT) between region pairs, showing physical network constraints:

From ↓ / To →US-East-1US-West-2EU-West-1AP-Southeast-1
US-East-15-10ms60-70ms65-75ms210-225ms
US-West-260-70ms5-10ms115-125ms160-170ms
EU-West-165-75ms115-125ms5-10ms170-180ms
AP-Southeast-1210-225ms160-170ms170-180ms5-10ms

Write latency = slowest region in quorum set. Strong consistency (Billing, Campaigns): 60-225ms quorum writes. Eventual consistency (Profiles): 5-10ms local write, async propagation. Redis: <1ms local-only, no cross-region sync.

Read Routing Strategy

Data TypeRead TypeRouting LogicLatency (Regional / Cross-Region)
Billing LedgerStrong readRoute to leaseholder (may be cross-region)5-10ms (local) / 60-225ms (remote)
Campaign ConfigsStrong readRoute to leaseholder (may be cross-region)5-10ms (local) / 60-225ms (remote)
User ProfilesStale read (default)Nearest replica (always local)5-10ms (local only)
User ProfilesFresh read (rare)Route to leaseholder (may be cross-region)5-10ms (local) / 60-225ms (remote)
Budget CountersAtomic readLocal Redis only (no cross-region)<1ms (local only)
ML FeaturesCache readLocal Redis only (no cross-region)<1ms (local only)

Trade-off: User profile updates written to US-East (5-10ms) may appear stale in EU-West for seconds due to replication lag. Acceptable for targeting (outdated interests don’t impact revenue materially), but cross-region strong reads (65-75ms) would violate 10ms User Profile SLA.

Redis (Budget Pre-Allocation, User Sessions):

CRITICAL ARCHITECTURAL DECISION: Redis does NOT replicate across regions in this design.

Why no cross-region Redis replication:

  1. Latency: Redis replication is synchronous or asynchronous. Synchronous = 50-100ms write latency (violates our <1ms budget enforcement requirement). Asynchronous = data loss during failover.
  2. Complexity: Redis Cluster cross-region replication requires custom solutions (RedisLabs, custom scripts).
  3. Acceptable trade-off: Budget pre-allocations are already bounded loss (see below).

Each region has independent Redis:

Data Consistency During Regional Failover (CRITICAL):

The Budget Counter Problem: When US-East fails, what happens to budget allocations stored in US-East Redis?

Example scenario:

What happens:

  1. Immediate impact: Remaining allocation (0.15 × B_daily) in US-East Redis is lost (region unavailable)
  2. US-West takes over US-East traffic: Continues spending from its own allocation
  3. Bounded over-delivery: Max over-delivery = lost US-East allocation = 0.15 × B_daily
  4. Percentage impact: 15% over-delivery (exceeds our 1% target!)

Mitigation: CockroachDB-backed allocation tracking (implemented)

Every 60 seconds, each region writes actual spend to CockroachDB:

Heartbeat update (US-East region, every 60s while healthy):

Failover recovery process:

  1. T+0s: US-East fails
  2. T+90s: Health checks trigger failover, US-West starts receiving US-East traffic
  3. T+120s: Atomic Pacing Service detects US-East heartbeat timeout (last write was 120s ago)
  4. T+120s: Atomic Pacing Service reads last known state from CockroachDB:
    • US-East allocated: 0.30 × B_daily
    • US-East spent: 50% of allocation (written 120s ago)
    • Remaining (uncertain): ~15% of B_daily
  5. T+120s: Atomic Pacing Service marks US-East allocation as “failed” and removes from available budget
  6. Result: 15% of budget locked but not over-delivered

Bounded under-delivery: Max under-delivery = unspent allocation in failed region = 15% of budget.

Concrete example with dollar amounts:

Campaign with $10,000 daily budget:

Why under-delivery is acceptable:

Failure Scenario: US-East Regional Outage

Scenario: Primary region (US-East) fails, handling 40% of traffic. What happens?

Failover Timeline:

TimeEventSystem State
T+0sHealth check failures detectedDNS TTL delay (60s)
T+30s3× traffic hits US-WestCPU: 40%→85%, standby activated
T+60sAuto-scaling triggeredProvisioning new capacity
T+90sCache hit degradationLatency p95: 100ms→150ms
T+90sRoute53 marks US-East unhealthyDNS failover begins
T+90-100sNew instances onlineCapacity restored (30-40s provisioning after T+60s trigger)
T+120sAtomic Pacing Service locks US-East allocationsUnder-delivery protection active

Why 20% Standby is Insufficient:

The timeline above shows a critical problem: from T+30s to T+90-100s (60-70 seconds with modern tooling), US-West is severely overloaded. To understand why, we need queueing theory.

Capacity Analysis:

Server utilization in queueing theory: $$\rho = \frac{\lambda}{c \mu}$$

where:

Critical thresholds:

Normal operation (US-West):

During US-East failure (US-West receives 40% of total traffic):

Auto-scaling limitations: Kubernetes HPA triggers at T+60s, but provisioning new capacity takes 30-40 seconds for GPU-based ML inference nodes with modern tooling (pre-warmed images, model streaming, VRAM caching). Without optimization, this can extend to 90-120s (cold pulls, full model loading). During this window, the system operates at 2× over capacity, making graceful degradation essential.

Mitigation: Graceful Degradation + Load Shedding

Architectural Driver: Availability - During regional failures, graceful degradation (serving stale cache, shedding low-value traffic) maintains uptime while minimizing revenue impact. Better to serve degraded ads than no ads.

The system employs a two-layer mitigation strategy:

Layer 1: Service-Level Degradation (Circuit Breakers)

  1. ML Inference: Switch to cached CTR predictions (-8% revenue impact)
  2. User Profiles: Serve stale cache with 5-minute TTL (-5% impact)
  3. RTB Auction: Reduce to top 20 DSPs only (-6% impact)

Layer 2: Load Shedding (Utilization-Based)

When utilization exceeds capacity despite degradation:

UtilizationActionLogic
<70%Accept allNormal operation
70-90%Accept all + degrade servicesCircuit breakers active, auto-scaling triggered
>90%Value-based sheddingAccept high-value (>P95), reject 50% of low-value

Combined impact during regional failover:

Failback Strategy:

After US-East recovers, gradual traffic shift back:

Automated steps:

  1. T+0: US-East infrastructure restored, health checks start passing
  2. T+5min: Route53 marks US-East healthy again, BUT weight set to 0%
  3. T+5min: Manual verification: Engineering team checks metrics, error rates
  4. T+10min: Traffic ramp begins: 5% → 10% → 25% → 50% → 100% over 30 minutes
  5. T+40min: Full traffic restored to US-East

Manual gates: Failback is semi-automatic. Requires manual approval at each stage to prevent cascade failures.

Data reconciliation:

CockroachDB: Already consistent (Raft consensus maintained across regions). Redis: Rebuild from scratch (Atomic Pacing Service re-allocates budgets based on CockroachDB source of truth, cold cache for 10-20 minutes).

Why gradual failback: Prevents “split-brain” scenario where both regions think they’re primary.

Cost Analysis: Multi-Region Economics

Infrastructure cost multipliers:

ComponentSingle RegionMulti-Region (4 regions)Multiplier
Compute (ad servers, ML)Baseline3× baseline
CockroachDB (5 replicas)Baseline3× baseline
Redis (per region)Baseline3× baseline
Cross-region data transferNone30% of baselineSignificant (new cost category)
Route53 (health checks)Baseline3× baseline
TotalBaseline3.3× baseline3.3×

Cross-region data transfer breakdown:

Key cost drivers:

Economic justification:

Single region annual risk:

Multi-region infrastructure availability: 99.99%+ (survives full regional failures)

Note: Our service SLO remains 99.9% regardless of deployment strategy. Multi-region provides availability headroom - the infrastructure supports higher uptime than we commit to users, providing buffer for application-level failures.

Trade-off analysis:

Intangible benefits:

Decision: Multi-region worth the 3.3× cost multiplier for platforms where revenue rate justifies availability investment.

Note on cost multiplier breakdown: The 3.3× figure is derived from:

Industry validation: Dual-region setups cost 1.3-2× (not 2×) due to shared infrastructure. For 4-region deployments, the multiplier falls between 3-3.5× based on documented case studies. This estimate is order-of-magnitude accurate but workload-dependent.

Capacity conclusion: 20% standby insufficient for immediate regional takeover, but combined with auto-scaling (30-40s with modern tooling, 90-120s legacy) and graceful degradation, provides cost-effective resilience. Alternative (200% over-provisioning per region) would reach 8-10× baseline costs. Trade-off: Accept degraded performance and bounded under-delivery during rare regional failures rather than excessive capacity overhead.


Schema Evolution: Zero-Downtime Data Migration

The Challenge:

You’ve been running your CockroachDB-based user profile store for 18 months. It’s grown to 4TB across 60 nodes. Now the product team wants to add a complex new feature that requires fundamental schema changes:

The constraint: Zero downtime. You can’t take the platform offline for migration.

Why Schema Evolution in Distributed SQL:

CockroachDB (distributed SQL) provides native schema migration support with ALTER TABLE, but large-scale changes still require careful planning:

  1. Online schema changes - CockroachDB supports most DDL operations without blocking (ADD COLUMN, CREATE INDEX with CONCURRENTLY)
  2. Strong consistency - ACID guarantees mean no dual-schema reads (unlike eventual consistency systems)
  3. Massive scale - 4TB migration for index backfill = 2-4 hours with proper throttling
  4. Version compatibility - Application code should use backward-compatible queries during rolling deployment

Zero-Downtime Migration Strategy:

Phase 1: Add Column (Non-blocking - Day 1)

CockroachDB supports online schema changes with ALTER TABLE:

Schema change (non-blocking, returns immediately):

Application code updated to write to new column immediately. Reads handle both NULL (old rows) and populated (new rows) gracefully.

No dual-write complexity: ACID transactions guarantee consistency - either transaction sees new schema or old schema, never inconsistent state.

Phase 2: Add Index (Background with throttling - Week 1-2)

Create index with CONCURRENTLY to avoid blocking writes:

Index creation (concurrent, non-blocking):

Index backfill rate:

CockroachDB throttles background index creation to ~25% of cluster resources to avoid impacting production traffic. For 4TB data:

$$T_{index} = \frac{4000 \text{ GB}}{100 \text{ MB/s} \times 0.25} \approx 4-6 \text{ hours}$$

Monitor progress: SHOW JOBS displays percentage complete and estimated completion time.

Phase 3: Partition Restructuring (Complex - Week 2-4)

Modifying table partitioning (adding region to partition key) requires creating new table with desired partitioning, then migrating data. This is the only operation that requires dual-write pattern:

Create new partitioned table (user_profiles_v2):

Dual-write application logic (temporary, 2-4 weeks):

Why this is simpler than Cassandra:

Rollback Strategy:

At any point during migration, rollback is possible:

PhaseRollback ComplexityMax Data Loss
Phase 1-2 (Dual-write)Easy - flip read source back to old schema0 (both schemas current)
Phase 3-4 (Gradual cutover)Medium - revert traffic percentage0 (still dual-writing)
Phase 5 (Cleanup started)Hard - restore from archiveUp to 90 days if archive corrupted

Critical lesson: Keep dual-write active for 2+ weeks after full cutover to ensure new schema stability before cleanup.

CockroachDB-Specific Advantages:

Online schema changes:

CockroachDB performs most schema changes online without blocking - adding columns, creating indexes, and modifying constraints happen in the background while applications continue to operate normally.

Partition restructuring complexity:

Changing primary key requires full rewrite - you can’t update partition key in place:

Schema change:

This requires complete data copy to new table with reshuffling across nodes. Plan for 2-4 week migration window for large datasets (estimate varies based on data volume, cluster capacity, and acceptable impact on production traffic).

Trade-off Analysis: Zero-Downtime vs Maintenance Window Migration

Context: Database schema changes (like changing primary keys or sharding strategies) require data migration. The choice is between engineering complexity (zero-downtime) vs business impact (downtime).

Option A: Zero-downtime migration (described above)

Option B: Maintenance window migration

Decision framework:

FactorZero-DowntimeMaintenance Window
Engineering cost0.3-0.4 engineer-years~0.05 engineer-years
ComplexityHigh (dual-write, background sync)Low (direct copy)
Business impactZero downtime12-24 days of hourly revenue lost
Cost ratio (baseline)40-70× revenue impact vs engineering cost

Decision: For revenue-generating platforms at scale, zero-downtime migration is economically justified by 40-70×. The engineering investment (0.3-0.4 engineer-years) is negligible compared to downtime impact (weeks of revenue compressed into 12-24 hours).

This conclusion holds across wide parameter ranges: even if engineering costs are 2× higher or platform traffic is 5× lower, zero-downtime migration remains the optimal choice for business-critical systems.


Distributed Clock Synchronization and Time Consistency

Architectural Driver: Financial Accuracy - Clock skew across regions can cause budget double-allocation or billing disputes. HLC + bounded allocation windows guarantee deterministic ordering for financial transactions.

Problem: Multi-region systems require accurate timestamps for budget tracking and billing reconciliation. Clock drift (1-50ms/day per server) causes billing disputes, budget race conditions, and causality violations. Without synchronization, 1000 servers can diverge by 50s in one day.

Solution Spectrum: NTP → PTP → Global Clocks

TechnologyAccuracyCostUse Case
NTP
Network Time Protocol
±50ms (public),
±10ms (local)
FreeGeneral-purpose time sync
PTP
Precision Time Protocol
±100μsMedium (hardware switches)High-frequency trading, telecom
GPS-based Clocks±1μsHigh
(GPS receivers per rack)
Critical infrastructure
Google Spanner
TrueTime
±7ms
(bounded uncertainty)
Very high (proprietary)Global strong consistency
AWS Time Sync Service<100μs (modern instances)
±1ms (legacy)
Free (on AWS)Cloud deployments (Nitro system 2021+)

Multi-tier time synchronization:

Tier 1 - Event Timestamping: AWS Time Sync (<100μs with modern instances, ±1ms legacy, free). Network latency (20-100ms) dwarfs clock skew, making NTP sufficient for impressions/clicks.

Tier 2 - Financial Reconciliation: CockroachDB built-in HLC provides automatic globally-ordered timestamps: \(HLC = (t_{physical}, c_{logical}, id_{node})\). Guarantees causality preservation (if A→B then HLC(A) < HLC(B)) and deterministic ordering via logical counters + node ID tie-breaking.

Clock skew mitigation: Create 200ms “dead zone” at day boundaries (23:59:59.900 to 00:00:00.100) where budget allocations are forbidden. Prevents regions with skewed clocks from over-allocating across day boundaries.

Architecture decision: AWS Time Sync (<100μs with modern instances, ±1ms legacy, free) + CockroachDB built-in HLC. Google Spanner’s TrueTime (±7ms) not worth complexity given 20-100ms network variability.

Note on AWS Time Sync accuracy: AWS upgraded Time Sync Service in 2021. Current-generation EC2 instances (Nitro system, 2021+) achieve <100μs accuracy using PTP hardware support. Older instance types (pre-2021 AMIs) see ±1ms. For this architecture, assume modern instances (<100μs). If using legacy infrastructure, adjust HLC uncertainty interval accordingly (see CockroachDB --max-offset flag).

Advantage: Eliminates ~150 lines of custom HLC code, provides battle-tested clock synchronization.

Monitoring: Alert if clock offset >100ms, HLC logical counter growth >1000/sec sustained, or budget discrepancy >0.5% of daily budget.

Global Event Ordering for Financial Ledgers: The External Consistency Challenge

Architectural Driver: Financial Accuracy - Financial audit trails require globally consistent event ordering across regions. CockroachDB’s HLC-timestamped billing ledger provides near-external consistency, ensuring that events are ordered chronologically for regulatory compliance. S3 + Athena serves as immutable cold archive for 7-year retention.

The Problem: Global Event Ordering

Budget pre-allocation (Redis) solves fast local enforcement, but billing ledgers require globally consistent event ordering across regions. Without coordinated timestamps, audit trails can show incorrect event sequences.

Example: US-East allocates budget amount A (T1), EU-West spends A exhausting budget (T2). Without coordinated timestamps, separate regional databases using local clocks might timestamp T1 after T2 due to clock skew, showing wrong ordering in audit logs.

Solution: CockroachDB HLC-Timestamped Ledger

CockroachDB provides near-external consistency using Hybrid Logical Clocks: $$HLC = (pt, c)$$ where pt = physical time, c = logical counter.

Guarantee: Causally related transactions get correctly ordered timestamps via Raft consensus. CockroachDB’s HLC uncertainty interval is dynamically bounded - legacy deployments use 500ms max_offset (default), but modern deployments with AWS Time Sync achieve <2ms uncertainty (500× improvement, see CockroachDB issue #75564). Independent transactions within this uncertainty window may have ambiguous ordering, but this is acceptable - even with 2ms uncertainty, network latency (60-225ms) already dominates, and causally related events (same campaign) are correctly ordered.

Requirements met:

Architecture Decision: Three-Tier Financial Data Storage

    
    graph LR
    ADV["Ad Server
1M QPS
Local budget: 0ms"] REDIS[("Tier 1: Redis
Atomic DECRBY
Allocation only")] CRDB[("Tier 2: CockroachDB
HLC Timestamps
10-15ms
90-day hot")] S3[("Tier 3: S3 Glacier + Athena
Cold Archive
7-year retention")] ADV -.->|"Allocation request
Every 30-60s (async)"| REDIS REDIS -->|"Reconciliation
Every 5 min"| CRDB CRDB -->|"Nightly archive
Parquet format"| S3 classDef fast fill:#e3f2fd,stroke:#1976d2 classDef ledger fill:#fff3e0,stroke:#f57c00 classDef archive fill:#f3e5f5,stroke:#7b1fa2 class REDIS fast class CRDB ledger class S3 archive

Why This Three-Tier Architecture:

TierTechnologyPurposeConsistency Requirement
Local CounterIn-memory CASPer-request spend tracking (0ms)Atomic in-memory operations
Tier 1: AllocationRedisGlobal budget allocation (async)Atomic DECRBY/INCRBY
Tier 2: Billing LedgerCockroachDBFinancial audit trail with global orderingSerializable + HLC ordering
Tier 3: Cold ArchiveS3 Glacier + Athena7-year regulatory retentionNone (immutable archive)

Workflow:

  1. Per-request spend (1M QPS): Local in-memory counter increment (0ms, not in critical path)
  2. Allocation request (every 30-60s): Ad Server requests budget chunk from Redis via DECRBY (async)
  3. Reconciliation (every 5min): Ad Server reports spend to CockroachDB with HLC timestamps
  4. Nightly archival: Export 90-day-old records to S3 Glacier in Parquet format (7-year retention, queryable via Athena for compliance audits)

Cost Analysis:

ComponentTechnologyRelative Cost
Fast pathRedis Cluster (20 nodes)18-22%
Billing ledger (90-day hot)CockroachDB (60-80 nodes)77-80%
Cold archive (7-year)S3 Glacier + Athena1-2%
Total financial storage100% baseline

Why S3 Glacier + Athena over PostgreSQL:

Build vs Buy: Custom PostgreSQL + HLC implementation costs 1-1.5 engineer-years plus ongoing maintenance. CockroachDB’s premium (20-30% of financial storage baseline) eliminates upfront engineering cost and operational burden. For cold archive, S3 + Athena is the clear choice - no operational burden and 50-100× cheaper than running a database.

Financial Audit Log Reconciliation

Purpose: Verify operational ledger (CockroachDB) matches immutable audit log (ClickHouse) to detect data inconsistencies, event emission bugs, or system integrity issues before they compound into billing disputes.

Dual-Ledger Architecture:

    
    graph TB
    ADV[Budget Service
Ad Server] ADV -->|"1 - Direct write
Transactional"| CRDB[("CockroachDB
Operational Ledger
90-day hot")] ADV -->|"2 - Publish event
Async"| KAFKA[("Kafka
Financial Events")] KAFKA -->|"Stream"| CH[("ClickHouse
Immutable Audit Log
7-year retention")] RECON[Reconciliation Job
Daily 2:00 AM UTC] CRDB -.->|"Aggregate yesterday"| RECON CH -.->|"Aggregate yesterday"| RECON RECON -->|"99.999% match"| OK[No action] RECON -->|"Discrepancy detected"| ALERT[Alert Finance Team
P1 Page] ALERT --> INVESTIGATE[Investigation:
- Kafka lag 85%
- Schema mismatch 10%
- Event bug 5%] classDef operational fill:#fff3e0,stroke:#f57c00 classDef audit fill:#e8f5e9,stroke:#388e3c classDef stream fill:#e3f2fd,stroke:#1976d2 classDef check fill:#f3e5f5,stroke:#7b1fa2 class CRDB operational class CH audit class KAFKA stream class RECON,ALERT,INVESTIGATE check

Daily Reconciliation Job (automated, runs 2:00 AM UTC):

Step 1: Query Both Systems

Extract previous 24 hours of financial data from both ledgers:

Step 2: Compare Aggregates

Per-campaign validation with acceptable tolerance:

Step 3: Alert on Discrepancies

Automated notification when thresholds exceeded:

Step 4: Investigation Workflow

Forensic analysis to identify root cause:

  1. Drill-down query: Retrieve all transactions for affected campaignId from both systems ordered by timestamp
  2. Event correlation: Match requestId between operational logs and audit trail to identify missing/duplicate events
  3. Common causes identified:
    • Kafka lag (85% of discrepancies): Event delayed >24 hours due to consumer backlog → resolves automatically when ClickHouse catches up
    • Schema mismatch (10%): Field name change in event emission without updating ClickHouse parser → fix parser, backfill missing events
    • Event emission bug (5%): Edge case where Budget Service fails to emit event → fix bug, manual INSERT into ClickHouse with audit trail explanation

Step 5: Resolution

Manual intervention when automated reconciliation fails:

Compliance Verification

Quarterly Audit Preparation:

External auditor access workflow:

  1. Export ClickHouse data: Generate Parquet files for audit period (e.g., Q4 2024: Oct 1 - Dec 31)
  2. Cryptographic verification: Run hash chain validation across exported dataset, produce merkle tree root hash as tamper-evident seal
  3. Auditor query interface: Provide read-only Metabase dashboard with pre-built queries (campaign spend totals, refund analysis, dispute history)
  4. Documentation bundle: Reconciliation job logs, discrepancy resolution tickets, system architecture diagrams

SOX Control Documentation:

Segregation of Duties:

Change Audit:

Administrative operations on financial data systems logged separately:

Access Control Matrix:

RoleCockroachDBClickHouseKafkaPermitted Operations
Budget ServiceWrite-onlyNo accessPublish eventsINSERT billing records
Finance TeamRead-onlyRead-onlyNo accessQuery, export, reporting
DBA TeamAdminRead-onlyAdminSchema changes, performance tuning
Audit TeamRead-onlyRead-onlyRead-onlyCompliance verification
EngineeringRead-only (production)Read-onlyRead-onlyDebugging, monitoring

Retention Policy Enforcement:

Automated Archival (runs monthly):

Data lifecycle management ensuring compliance while optimizing costs:

  1. Age detection: Identify partitions older than 7 years based on timestamp conversion to year-month format
  2. Export to cold storage: Write partition data to S3 Glacier in Parquet format with WORM (Write-Once-Read-Many) configuration
  3. External table creation: Create ClickHouse external table pointing to S3 location (data remains queryable via standard SQL but stored at 1/50th cost)
  4. Partition drop: Remove from ClickHouse hot storage after S3 export verified (logged as administrative action)
  5. Verification: Monthly job validates S3 object count matches dropped partitions, alerts if mismatch detected

Cost Impact:

Retention policy reduces storage costs while maintaining compliance accessibility:

Budget Reconciliation & Advertiser Compensation Workflow

Architectural Driver: Financial Accuracy - Automated discrepancy detection, retroactive correction, and advertiser compensation workflows ensure billing accuracy ≤1% while maintaining trust. Manual intervention only for exceptions >2% of budget.

The Problem: Budget Overspend & Underspend Edge Cases

Despite bounded micro-ledger (BML) architecture with 0.5-1% inaccuracy bounds, edge cases cause billing discrepancies:

Root causes:

  1. Redis failover: Regional failure with unsynced counter state (15% under-delivery risk per Part 4 multi-region section)
  2. Network partitions: Split-brain scenario where regions can’t sync budget state (bounded by allocation window: max 5min × allocation rate)
  3. Clock skew beyond bounds: HLC uncertainty >2ms in legacy deployments (rare with AWS Time Sync, but possible during NTP failures)
  4. Race conditions at day boundary: Multiple regions allocating final budget chunks simultaneously (mitigated by 200ms dead zone, but not eliminated)
  5. Software bugs: Event emission failures, counter drift, schema evolution issues

Financial trust requirement: This platform targets ≤1% budget variance (stricter than industry-wide ad discrepancy standards of 1-10%, per IAB guidelines). Enterprise advertisers set hard daily budgets and expect strict enforcement. Advertisers tolerate 1-2% variance without complaint, escalate at 2-5%, and demand refunds/credits >5%. Automation required to handle 95%+ of cases without manual intervention.

Preventive Measures During Edge Cases

Real-time throttling bounds overspend during active failures (reconciliation handles post-hoc correction):

Network partition throttling: Detect sync failure (CockroachDB heartbeat >120s, Redis lag >5s) → reduce allocation to 50% rate per region. With throttling: 3 regions at 50% = 0.175% overspend ($17.50 on $10K daily budget for 5-min window). Without throttling: 0.7% overspend ($70). Throttling reduces exposure by 75%.

Clock skew protection: 200ms dead zone at day boundaries (23:59:59.900 to 00:00:00.100) prevents double-allocation when region clocks differ by ±200ms.

Race condition mitigation (low budget <5%): Pessimistic locking (CockroachDB SELECT FOR UPDATE) serializes allocation requests. Failed regions retry with 50% allocation size, accepting uneven distribution over overspend.

Reconciliation Architecture

The system uses a four-stage pipeline to detect, classify, correct, and compensate for budget discrepancies:

    
    graph TB
    subgraph "Stage 1: Detection (Every 5 min)"
        REDIS[("Redis Counters
Live Spend")] CRDB[("CockroachDB
Billing Ledger")] RECON_LIVE[Live Reconciliation Job
Compare Redis vs CockroachDB] REDIS --> RECON_LIVE CRDB --> RECON_LIVE RECON_LIVE -->|Δ ≤ 1%| OK1[No action] RECON_LIVE -->|1% < Δ ≤ 2%| WARN[Log warning] RECON_LIVE -->|Δ > 2%| ALERT[P2 Alert] end subgraph "Stage 2: Classification (Daily 2 AM UTC)" DAILY[Daily Reconciliation
Final spend vs budget] CRDB --> DAILY DAILY -->|Exact match| OK2[No action] DAILY -->|Underspend| UNDER{Amount?} DAILY -->|Overspend| OVER{Amount?} UNDER -->|< 1%| ACCEPT_U[Accept
Log only] UNDER -->|≥ 1%| CREDIT[Issue Credit] OVER -->|≤ 1%| ACCEPT_O[Accept
Bounded by design] OVER -->|> 1%| REFUND[Issue Refund] end subgraph "Stage 3: Correction" CREDIT --> LEDGER_ADJ[Ledger Adjustment
CockroachDB] REFUND --> LEDGER_ADJ LEDGER_ADJ --> AUDIT[Audit Log Entry
Immutable ClickHouse] end subgraph "Stage 4: Compensation" AUDIT -->|Underspend ≥ 1%| AUTO_CREDIT[Automated Credit
Advertiser Account] AUDIT -->|Overspend > 1%| AUTO_REFUND[Automated Refund
Payment Gateway] AUTO_CREDIT --> NOTIFY[Email Notification
+ Dashboard Update] AUTO_REFUND --> NOTIFY NOTIFY -->|Any > 2%| MANUAL_REVIEW[Manual Review
Finance Team] end classDef detection fill:#e3f2fd,stroke:#1976d2 classDef classification fill:#fff3e0,stroke:#f57c00 classDef correction fill:#e8f5e9,stroke:#388e3c classDef compensation fill:#f3e5f5,stroke:#7b1fa2 class RECON_LIVE,REDIS,CRDB detection class DAILY,UNDER,OVER classification class LEDGER_ADJ,AUDIT correction class AUTO_CREDIT,AUTO_REFUND,NOTIFY,MANUAL_REVIEW compensation

Stage 1: Discrepancy Detection (Live + Daily)

Live (every 5 min): Compare Redis counters vs CockroachDB ledger. Thresholds: ≤1% no action, 1-2% log warning, >2% P2 alert.

Daily (2 AM UTC): End-of-day reconciliation of final spend vs budget. Classify as UNDERSPEND/OVERSPEND/EXACT. Process only variances >0.5%.

Stage 2: Classification & Decision Logic

Underspend scenarios:

VarianceRoot CauseActionJustification
<0.5%Normal allocation granularityAcceptAdvertiser unlikely to notice
0.5-1%Redis sync lag, allocation roundingAccept + logWithin industry tolerance
1-5%Regional failover (bounded loss)Issue creditAdvertiser paid for undelivered impressions
>5%Software bug or manual pauseIssue credit + investigateSignificant revenue loss to advertiser

Overspend scenarios:

VarianceRoot CauseActionJustification
≤0.5%BML inaccuracy bound (by design)AcceptWithin contractual SLA (≤1%)
0.5-1%Day boundary race, clock skewAccept + logWithin industry tolerance
1-2%Network partition, extended sync failureIssue refundAdvertiser charged for unauthorized spend
>2%Software bug (counter drift, event loss)Issue refund + P1 incidentContractual breach, potential legal risk

Key principle: Conservative advertiser protection. Under-delivery requires credit. Over-delivery requires refund even within 1% bound.

Stage 3: Retroactive Correction

All corrections recorded as immutable audit trail. Atomic transaction: (1) Insert adjustment entry (type, amount, reason_code, timestamps, audit_reference), (2) Update campaign summary (corrected_spend, correction_count, timestamp), (3) Emit to ClickHouse via Kafka.

ClickHouse audit trail: Permanent record with correction_id, campaign/advertiser IDs, financial data (budget, actual, variance), classification (OVERSPEND/UNDERSPEND), root_cause, compensation_status, timestamps, metadata. Partitioned by month, append-only.

Stage 4: Advertiser Compensation Automation

Credits (underspend ≥1%): Calculate → apply to account balance → record transaction → email notification → dashboard update.

Refunds (overspend >1%): Calculate → submit to payment gateway (Stripe/Braintree) → record transaction → email notification → dashboard shows 5-10 day ETA.

Manual review (>2% variance): Create Jira ticket, Slack #finance-alerts, flag account “Under Review”, hold new campaigns. Finance team verifies root cause, confirms calculation, reviews payment history, approves/rejects compensation, documents resolution.

Advertiser Dashboard Impact

Campaign detail view shows: campaign ID/name/date, budget commitment, actual delivery, variance ($ and %), status flag (Exact/Under-delivered/Over-delivered with color coding).

Correction display (≥1% variance): Compensation type/amount, timestamp (local + UTC), status (Applied/Pending/Under Review), plain-language explanation (e.g., “infrastructure maintenance” not “Redis failover”), resolution action, next steps.

Example: $10K budget, $8.5K actual → amber “Under-delivered” flag, $1.5K credit applied, message: “Delivery interrupted due to infrastructure maintenance. We’ve automatically credited $1,500 to your account balance. This credit is available immediately.”

Platform metrics displayed: “98.2% campaigns within ±1% (target 98%), 99.6% within ±2%, avg correction time 6 hours, 97% automated”.

Error Budget & Monitoring

Financial accuracy SLO: 98% campaigns ±1% (target, aligns with Fluency benchmark), 99.5% ±2% (tolerance), 99.9% ±5% (escalation), <0.1% exceed ±5% (breach).

Monitoring: Daily variance distribution (30-day window): campaign counts by variance tier (1%, 2%, 5%), avg variance, P95/P99.

Compensation metrics: Credits 2-3% daily (median $15-50), refunds 0.1-0.5%, manual review <0.05%, total cost 0.2-0.3% gross revenue.

ROI: $50K annual cost prevents $500K+ legal risk, saves $100K finance overhead, reduces 2-5% churn → 5-10× return.


Observability and Operations

Service Level Indicators and Objectives

Key SLIs:

ServiceSLITargetWhy
Ad APIAvailability99.9%Revenue tied to successful serves
Ad APILatencyp95 <150ms, p99 <200msMobile timeouts above 200ms
MLAccuracyAUC >0.78Below 0.75 = 15%+ revenue drop
RTBResponse Rate>80% DSPs within 100ms<80% = remove from rotation
BudgetConsistencyOver-delivery <1%>2% = refunds, >5% = lawsuits

Error Budget Policy (99.9% = 43 min/month):

When budget exhausted:

  1. Freeze feature launches (critical fixes only)
  2. Focus on reliability work
  3. Mandatory root cause analysis
  4. Next month: 99.95% target to rebuild trust

Incident Response Dashboard

Effective incident response requires immediate access to:

SLO deviation metrics - Latency (p95, p99) and error rate vs targets to determine severity

Resource utilization - CPU/GPU/memory metrics plus active configuration (model versions, feature flags) to distinguish capacity from configuration issues

Dependency breakdown - Per-service latency (cache, database, ML, external APIs) to isolate the actual bottleneck

Historical patterns - Similar past incidents and time-series showing when degradation began

Distributed Tracing

Single user reports “ad not loading” among 1M+ req/sec:

Request ID: 7f3a8b2c… Total latency: 287ms (VIOLATED SLO)

Trace breakdown:

Root cause: Redis node failure → cascading slowdown. Trace shows exactly why.


Security and Compliance

Service-to-Service Authentication: Zero Trust with mTLS

In distributed systems with 50+ microservices, network perimeters are insufficient. Solution: mutual TLS (mTLS) via Istio service mesh.

Every service receives a unique X.509 certificate (24-hour TTL) from Istio CA via SPIFFE/SPIRE. Envoy sidecar proxies automatically handle certificate rotation, mutual authentication, and traffic encryption - transparent to application code. All plaintext connections are rejected.

Authorization policies enforce least-privilege access:

Defense in depth: Even if network segmentation fails, attackers cannot decrypt inter-service traffic, impersonate services, or call unauthorized endpoints.

PII Protection:

Secrets: Vault with Dynamic Credentials

ML Data Poisoning Protection:

Training pipeline validates incoming events before model training:

  1. CTR anomaly detection: Quarantine events with >3σ CTR spikes (e.g., 2%→8%)
  2. IP entropy check: Flag low-diversity IP clusters (<2.0 entropy = botnet)
  3. Temporal patterns: Detect uniform timing intervals (human=bursty, bot=mechanical)

Model integrity: GPG-signed models prevent loading tampered artifacts. Inference servers verify signatures before loading models, rejecting invalid signatures with immediate alerting.

Data Lifecycle and GDPR

Retention policies:

DataRetentionRationale
Raw events7 daysReal-time only; archive to S3
Aggregated metrics90 daysDashboard queries
Model training data30 daysOlder data less predictive
User profiles365 daysGDPR; inactive purged
Audit logs7 yearsLegal compliance

GDPR “Right to be Forgotten”:

Per GDPR Article 12, the platform must respond to erasure requests within one month (can be extended to three months for complex cases). Deletion is executed across 10+ systems in parallel:

Verification: All systems confirm deletion completion → send deletion certificate to user within one month of request (target: 48-72 hours for standard cases).

Note on financial records: GDPR allows retention of financial transaction records beyond deletion requests when required by law (SOX, MiFID). User PII (name, email, demographics) is deleted, but anonymized transaction records ($X spent on date Y) are retained in S3 cold archive for regulatory compliance.


Production Operations at Scale

Deployment Safety and Zero-Downtime Operations

The availability imperative: With 99.9% SLO providing only 43 minutes/month error budget, we cannot afford to waste any portion on planned downtime. All deployments and schema changes must be zero-downtime operations.

Progressive deployment strategy:

Rolling deployments (canary → 10% → 50% → 100%) with automated gates on error rate, latency p99, and revenue metrics. Each phase must pass health checks before proceeding. Feature flags provide blast radius control - new features start dark, gradually enabled per user cohort.

Zero-downtime schema migrations:

Database schema changes consume zero availability budget through online DDL operations:

The cost trade-off is clear: zero-downtime migrations require 2-4× more engineering effort than “take the system down” approaches, but protect against wasting the precious 43-minute availability budget on planned maintenance.

Key insight: The 43 minutes/month error budget is reserved for unplanned failures (infrastructure outages, cascading failures, external dependency failures). Planned operations (deployments, migrations, configuration changes) must never consume this budget.

Error Budgets: Balancing Velocity and Reliability

Error budgets formalize the trade-off between reliability and feature velocity. For a 99.9% availability SLO, the error budget is 43.2 minutes/month of unplanned downtime.

$$\text{Error Budget} = (1 - 0.999) \times 30 \times 24 \times 60 = 43.2 \text{ minutes/month}$$

Budget allocation strategy (unplanned failures only):

SourceAllocationRationale
Infrastructure failures15 min (35%)Cloud provider incidents, hardware failures, regional outages
Dependency failures12 min (28%)External DSP timeouts, third-party API issues
Code defects8 min (19%)Bugs escaping progressive rollout gates
Unknown/buffer8 min (18%)Unexpected failure modes, cascading failures

Note: Planned deployments and schema migrations target zero downtime through progressive rollouts and online DDL operations. When deployment-related issues occur (e.g., bad code pushed past canary gates), they count against “Code defects” budget.

Burn rate alerting:

Monitor how quickly budget is consumed. Burn rate = current error rate / target error rate. A 10× burn rate means exhausting the monthly budget in ~3 hours, triggering immediate on-call escalation.

Policy-driven decision making:

Error budget remaining drives release velocity:

Why 99.9% not 99.99%?

With zero-downtime deployments and migrations eliminating planned downtime, the 99.9% SLO (43 minutes/month) is entirely allocated to unplanned failures. Moving to 99.99% (4.3 minutes/month) would reduce our tolerance for unplanned failures from 43 to 4.3 minutes - a 10× tighter constraint.

This requires multi-region active-active with automatic failover (approximately doubling infrastructure costs) to achieve sub-minute recovery from regional outages. The economic question: is tolerating 39 fewer minutes of unplanned failures worth doubling infrastructure spend?

For advertising platforms with client-side retries and geographic distribution, the answer is no for most advertising platforms. Brief regional outages have limited revenue impact due to automatic retries and traffic redistribution. Better ROI comes from reducing MTTR (faster detection and recovery) than preventing all failures.

The tolerance for unplanned failures varies by domain - payment processing or healthcare systems require 99.99%+ because every transaction matters. Ad platforms operate at higher request volumes where statistical averaging and retries provide natural resilience.

Cost Management at Scale

Resource attribution with chargeback models (vCPU-hours, GPU-hours, storage IOPS per team). Standard optimizations: spot instances for training (70% cheaper), tiered storage, reserved capacity for baseline load. Track efficiency via vCPU-ms per request and investigate >15% month-over-month increases.


Resilience and Failure Scenarios

A robust architecture must survive catastrophic failures, security breaches, and business model pivots. This section addresses three critical scenarios:

Catastrophic Regional Failure: When an entire AWS region fails, our semi-automatic failover mechanism combines Route53 health checks (2-minute detection) with manual runbook execution to promote secondary regions. The critical challenge is budget counter consistency—asynchronous Redis replication creates potential overspend windows during failover. We mitigate this through pre-allocation patterns that limit blast radius to allocated quotas per ad server, bounded by replication lag multiplied by allocation size.

Malicious Insider Attack: Defense-in-depth through zero-trust architecture (SPIFFE/SPIRE for workload identity), mutual TLS between all services, and behavioral anomaly detection on budget operations. Critical financial operations like budget allocations require cryptographic signing with Kafka message authentication, creating an immutable audit trail. Lateral movement is constrained through Istio authorization policies enforcing least-privilege service mesh access.

Business Model Pivot to Guaranteed Inventory: Transitioning from auction-based to guaranteed delivery requires strong consistency for impression quotas. Rather than replacing our stack, we extend the existing pre-allocation pattern—CockroachDB maintains source-of-truth impression counters (leveraging the same HLC-based billing ledger) while Redis provides fast-path allocation with periodic reconciliation. This hybrid approach adds only 10-15ms to the critical path for guaranteed campaigns while preserving sub-millisecond performance for auction traffic. The 12-month evolution path reuses 80% of existing infrastructure (ML pipeline, feature store, Kafka, billing ledger) while adding campaign management and SLA tracking layers.

These scenarios validate that the architecture is not merely elegant on paper, but battle-hardened for production realities: regional disasters, adversarial threats, and fundamental business transformations.


Summary: Production Readiness Across All Dimensions

Production-grade distributed systems require more than elegant design—they demand operational rigor across eight critical dimensions. This post bridged the gap between architecture and reality by addressing how systems survive at 1M+ QPS under real-world conditions.

The eight pillars:

1. Fraud Detection - Multi-tier pattern detection (L1 Bloom filters at 0.5ms, L2 behavioral rules, L3 ML batch) catches 20-30% of bot traffic before expensive RTB calls, saving significant external DSP bandwidth costs.

2. Multi-Region Deployment - Active-active architecture across 3 AWS regions with semi-automatic failover (2min Route53 detection + manual runbook execution). Handles split-brain through pre-allocation patterns limiting overspend to <1% during replication lag windows.

3. Schema Evolution - Zero-downtime migrations using dual-write patterns and gradual cutover preserve 99.9% availability SLO. Trade 2-4× engineering effort for keeping 43min/month error budget available for unplanned failures.

4. Clock Synchronization - Hybrid Logical Clocks (HLC) in CockroachDB provide causally-consistent timestamps for financial ledgers without TrueTime hardware, ensuring regulatory compliance for audit trails.

5. Observability - SLO-based monitoring with 99.9% availability target (43min/month downtime budget). Burn rate alerting triggers paging at 10× consumption rate. Prometheus metrics, Jaeger traces (1% sampling), centralized logs.

6. Security & Compliance - Zero-trust architecture with mTLS service mesh (Istio), workload identity (SPIFFE/SPIRE), encryption at rest/transit, immutable audit logs. GDPR right-to-deletion via cascade deletes, CCPA data export on demand.

7. Production Operations - Progressive rollouts (1% → 10% → 50% → 100%) with automated gates checking error rates and latency. <5min rollback SLA from detection to restored service. Rolling updates with health checks and connection draining.

8. Resilience Validation - Tested scenarios: regional disasters (2-5min recovery with bounded overspend), malicious insiders (zero-trust prevention), business model pivots (80% infrastructure reuse for auction→guaranteed delivery transition).

Core insight: Operational excellence isn’t bolted on after launch—it must be designed into the architecture from day one. Circuit breakers, observability hooks, audit trails, multi-region replication, and progressive deployment are architectural requirements, not implementation details.

Next: Part 5 brings everything together with the complete technology stack—concrete choices, configurations, and integration patterns that transform abstract requirements into a deployable system.


Back to top