Production Operations: Fraud, Multi-Region & Operational Excellence

02 November 2025 • Yuriy Polyulya •

#distributed-systems #fraud-detection #multi-region-deployment #ads-tech

Introduction: From Design to Production

Architecture on paper ≠ production system. You can design the most elegant distributed architecture - perfect latency budgets, optimal caching strategies, fair auction mechanisms - and it will fail in production without addressing operational realities.

The gap between design and production:

Most architecture discussions focus on the “happy path”:

Requests succeed and services respond quickly
Data stays consistent and caches stay fresh
External dependencies (DSPs) behave predictably
Traffic patterns match expectations
No one tries to exploit the system

Production systems face harsher realities:

Malicious traffic: Bot farms generate 20-30% of all clicks, draining advertiser budgets and wasting RTB bandwidth (64.8PB/month of fraudulent DSP calls)
Regional failures: Entire AWS regions go down. It’s not if, but when.
Schema evolution: Database schemas change while the system serves 1M+ QPS with zero downtime
Clock drift: Distributed timestamps diverge by milliseconds, breaking financial audit trails and causing budget discrepancies
Cascading failures: One service degrades (Feature Store at 15ms instead of 10ms), triggering circuit breakers, forcing fallbacks, and creating revenue impact
Unknown unknowns: Failure modes you never predicted - DNS issues, certificate expirations, upstream API changes

Why this matters at scale:

At 1M+ QPS serving 400M daily active users:

1% fraud rate = 10K fraudulent requests/second = massive bandwidth waste
1 minute of downtime = 60M failed requests = angry users + advertiser SLA violations
1ms of clock drift = financial audit failures + regulatory compliance issues
1 bad deployment = potential cascade to entire system

What this post covers:

This post addresses eight critical production topics that separate proof-of-concept from production-grade systems:

Fraud Detection - Pattern-based bot detection filtering 20-30% of malicious traffic BEFORE expensive RTB fan-out. Multi-tier detection (L1 real-time, L2 behavioral, L3 ML-based) with specific patterns for click farms, SDK spoofing, and domain fraud.
Multi-Region Deployment - Active-active architecture across 3 AWS regions with CockroachDB automatic failover (30-60s) and Route53 health-check routing (2min). Handling split-brain scenarios, regional budget pacing, and bounded overspend during failover.
Schema Evolution - Zero-downtime migrations using dual-write patterns, backward/forward compatible schema changes, and gradual rollouts. Changing the database while serving 1M QPS without dropping a single request.
Clock Synchronization - Why NTP (±50-100ms) isn’t good enough for financial ledgers. Hybrid Logical Clocks (HLC) for distributed timestamp ordering without TrueTime hardware. Preventing clock-drift-induced budget discrepancies.
Observability & Operations - SLO-based monitoring with error budgets (43min/month at 99.9%). RED metrics (Rate, Errors, Duration), distributed tracing for 150ms request paths, and structured logging at 1M+ QPS. Mean Time to Recovery (MTTR) as key operational metric.
Security & Compliance - Zero-trust architecture (every request authenticated/authorized), encryption (TLS 1.3, AES-256 at rest), audit trails for financial compliance (GDPR/CCPA), and defense against insider threats.
Production Operations - Progressive rollouts (1% → 10% → 50% → 100%), automated rollback triggers, chaos engineering validation, and incident response playbooks. Deployment safety at scale.
Resilience & Failure Scenarios - Testing the architecture under extreme conditions: regional disasters, malicious insiders, and business model pivots. Validating theoretical resilience through controlled chaos.

The core insight:

Production-grade systems require defense in depth across all dimensions:

Technical resilience: Multi-region, graceful degradation, zero-downtime operations
Security rigor: Zero-trust, encryption, audit trails, compliance
Operational discipline: Observability, deployment safety, incident response, chaos testing
Business protection: Fraud prevention, financial accuracy, SLA compliance

Each dimension reinforces the others. Fraud detection protects the business. Multi-region protects availability. Zero-downtime migrations protect error budgets. Clock synchronization protects financial integrity. Observability protects MTTR. Security protects against insider threats. Progressive rollouts protect against bad deployments. Chaos testing validates it all actually works.

Broader applicability:

These patterns - fraud detection, multi-region failover, zero-downtime migrations, distributed time synchronization - apply beyond ad tech to high-stakes distributed systems:

Financial platforms (transaction processing, regulatory compliance)
E-commerce (fraud prevention, global traffic routing)
Gaming (anti-cheat, regional failover)
SaaS platforms (zero-downtime updates, tenant isolation)

A key insight: operational excellence isn’t bolted on after launch - it must be designed into the system from the start. Circuit breakers, observability hooks, audit trails, multi-region data replication - these aren’t implementation details, they’re architectural requirements.

Let’s explore each production topic and how they integrate into a cohesive operational strategy.

Fraud Detection: Pattern-Based Abuse Detection

Architectural Driver: Financial Accuracy - While rate limiting (covered in the architecture post) controls request volume, fraud detection identifies malicious patterns. A bot clicking 5 ads/minute might pass rate limits but shows suspicious behavioral patterns. Both mechanisms work together: rate limiting stops volume abuse, fraud detection stops sophisticated attacks.

What Fraud Detection Does (vs Rate Limiting):

Fraud detection answers: “Are you malicious?”

Bot farm with 95% CTR, uniform timing, rotating IPs → blocked permanently
Protects advertiser budgets from wasted spend and platform from massive RTB bandwidth costs (early filtering prevents 20-30% egress waste - one of top 3 infrastructure costs at scale)

Rate limiting answers: “Are you requesting too much?” (covered in the architecture post)

Legitimate advertiser making 10K QPS (vs 1K limit) → throttled with 429
Protects infrastructure capacity and enforces SLA

Problem: Detect and block fraudulent ad clicks in real-time without adding significant latency.

CRITICAL: Integrity Check Service in Request Critical Path

The Integrity Check Service (L1 fraud detection) runs in the synchronous request path immediately after User Profile lookup and BEFORE the expensive RTB fan-out to 50+ DSPs. This placement is critical for cost optimization:

Cost Impact of Early Fraud Filtering: Without early filtering: RTB requests go to 50+ DSPs for ALL traffic, including 20-30% bot traffic

Bandwidth amplification: Each RTB request = ~2-4KB payload × 50 DSPs = 100-200KB per request
Bot traffic bandwidth waste: At 1M QPS with 25% fraud = 250K fraudulent requests/sec
- Wasted bandwidth: 25GB/sec = 64.8PB/month on fraudulent RTB calls
- Egress cost magnitude: Cloud providers charge egress bandwidth, making this one of the largest infrastructure cost categories at scale
DSP relationship cost: 50+ DSPs waste CPU cycles processing fraudulent bid requests → strained relationships, potential rate limiting

Solution: 5ms Integrity Check Service blocks 20-30% of fraud BEFORE RTB fan-out, eliminating this massive bandwidth waste.

Trade-off Analysis:

Added latency: 5ms per request (still within 150ms SLO with 5-7ms buffer)
Bandwidth savings: Eliminates 20-30% of total RTB egress (64.8PB/month prevented)
Cost structure: At scale, egress bandwidth is one of top 3 infrastructure costs (alongside compute and storage)
ROI: The 5ms latency investment (costing ~0.5-1% of impressions from slower response) saves 10,000-25,000× more in annual egress costs than it costs in lost opportunity
Secondary benefit: Preserves DSP relationships by not flooding them with bot traffic

Fraud Types:

Click Farms: Bots or paid humans generating fake clicks
SDK Spoofing: Fake app installations reporting ad clicks
Domain Spoofing: Fraudulent publishers misrepresenting site content
Ad Stacking: Multiple ads layered, only top visible but all “viewed”

Detection Strategy: Multi-Tier Filtering

    
    graph TB
    REQ[Ad Request] --> UP[User Profile
10ms]
    UP --> L1{L1: Integrity Check Service
5ms CRITICAL PATH
Runs BEFORE RTB}

    L1 -->|Known bad IP| BLOCK1[Block Request
Bloom Filter
No RTB call made]
    L1 -->|Pass| RTB[RTB Auction
100ms
50+ DSPs]

    RTB --> L2{L2: Behavioral
Post-click analysis
Async}

    L2 -->|Suspicious pattern| PROB[Flag for Review
50% sampled]
    L2 -->|Pass| L3{L3: ML Model
10ms async
Post-impression}

    L3 -->|Fraud score > 0.8| BLOCK2[Block User
Update Bloom filter]
    L3 -->|Pass| ALLOW[Legitimate Traffic]

    PROB -->|If sampled| BLOCK3[Add to Training Data]
    PROB -->|If not sampled| ALLOW

    BLOCK1 --> LOG[Log to fraud DB]
    BLOCK2 --> LOG
    BLOCK3 --> LOG

    style BLOCK1 fill:#ff6b6b
    style BLOCK2 fill:#ff6b6b
    style BLOCK3 fill:#ff6b6b
    style L1 fill:#ffdddd
    style ALLOW fill:#51cf66

L1: Integrity Check Service (5ms critical path, 20-30% fraud caught)

Implementation: Bloom filter + IP reputation + Device fingerprinting

Component: Integrity Check Service - Runs in synchronous request path BEFORE RTB fan-out

Decision flow (5ms budget):

Check IP against Bloom filter (~0.1ms) → if match, confirm with Redis (1ms) → BLOCK if confirmed (no RTB call)
Check device ID against IMPOSSIBLE_DEVICES list → BLOCK if invalid
Validate User-Agent format → BLOCK if malformed
Check device/OS combination validity → BLOCK if impossible (SDK spoofing)
Basic rate checks: Requests/sec from this IP/device → BLOCK if exceeds threshold
PASS to RTB/ML paths if all checks pass

Bloom filter characteristics:

Size: 10M entries, 0.1% false positive rate
Memory: ~18 MB for Bloom filter (14.378 bits per item); total fraud detection data ~120MB including reputation cache
Lookup: O(1), ~100 CPU cycles (~0.1ms)
Update: Every 5 minutes from fraud database (async)
Storage: Redis cluster (replicated for high availability)

Latency breakdown:

Bloom filter lookup: 0.1ms
Redis reputation check (if Bloom filter hits): 1ms
Device fingerprint validation: 0.5ms
User-Agent parsing: 0.3ms
Rate check (local counter): 0.1ms
Total budget: 5ms (includes network overhead + safety margin)

Examples caught by L1:

IP 1.2.3.4 previously flagged for 10,000 clicks/hour → BLOCKED, no RTB call made
Device fingerprint “iPhone_FRAUD_123” (known emulator signature) → BLOCKED immediately
User-agent “Mozilla/5.0 (iPhone; CPU iPhone OS 99_0)” (impossible OS version) → BLOCKED
IP making 1000 requests/sec (DDoS pattern) → BLOCKED

Critical Impact: Blocking at L1 saves 100-200KB bandwidth per blocked request (2-4KB × 50 DSPs) + DSP processing time

L2: Behavioral Analysis (Async post-click, 40-50% additional fraud caught)

Component: Fraud Analysis Service - Runs ASYNCHRONOUSLY after ad click/impression events, NOT in request critical path

When it runs: Triggered by click/impression events published to Kafka, analyzes patterns over time

Feature extraction (5ms per event):

Redis lookup: Fetch user history (3ms) - last 100 impressions, clicks, IPs, device changes
Calculate features (2ms):
- Time patterns: clicks/hour, avg interval between clicks, timing stddev (bots have low variance)
- CTR analysis: click rate over last 100 impressions
- Device diversity: unique IPs in last 24h, device changes in last 7 days
- Behavioral entropy: IP diversity, category spread

Rule-based thresholds → SUSPICIOUS if ANY:

CTR > 50% (normal users: 0.5-2%)
Timing stddev < 2.0 seconds (mechanical bot behavior)
Unique IPs/day > 50 (IP rotation / bot farm)
IP entropy < 2.0 (concentrated in single subnet / data center)

Actions:

If SUSPICIOUS: Add IP/device to Bloom filter (blocks future requests at L1)
If high confidence (multiple signals): Immediate block + refund advertiser
If borderline: Pass to L3 ML model for deeper analysis

Processing time:

Redis lookup (user history): 3ms
Feature calculation: 2ms
Total: 5ms (async, does not impact request latency)

Examples caught by L2:

User clicked 127 of last 100 ads (CTR = 1.27) → clearly fraud
Clicks every 12.3 seconds ±0.1s for 2 hours → mechanical
847 unique IPs in 24 hours → bot farm with IP rotation
IP entropy 1.2 (concentrated in /24 subnet) → data center origin

L3: ML Model (10ms latency, 10-15% additional fraud caught)

Gradient Boosted Decision Tree (GBDT) model:

Feature enrichment and inference (10ms):

Take 15 features from L2 (time patterns, CTR, device diversity, entropy)
Add 25 computed features:
- Temporal: hour of day, day of week, is weekend
- Historical aggregates: lifetime clicks, account age, avg session duration
- Reputation scores: device fraud score, IP fraud score (from lookup tables)
- Publisher context: publisher fraud rate
GBDT inference (200 trees, depth 6) → fraud score 0.0-1.0

Decision thresholds:

Score > 0.8: BLOCK (high confidence fraud)
Score 0.5-0.8: SUSPICIOUS (flag for manual review)
Score < 0.5: ALLOW (legitimate traffic)

Model characteristics:

Training data: 30 days of labeled fraud events (1B events, 3% fraud rate)
Features: 40 total (15 from L2, 25 computed in L3)
Model: GBDT with 200 trees, max depth 6
Latency: 10ms (CPU inference)
Accuracy: AUC 0.92 (fraud detection quality)
Update frequency: Weekly retraining, daily incremental updates

Examples caught by L3:

Fraud score 0.87: Sophisticated bot with randomized timing but weak device fingerprint
Fraud score 0.82: Click farm with realistic timing but publisher reputation low
Fraud score 0.79: SDK spoofing with valid-looking data but statistical anomalies

Performance Characteristics:

Tier	Latency	Fraud Caught	False Positive Rate	Cost
L1	0ms	20-30%	<0.1%	Negligible (Bloom filter)
L2	5ms	60-80% cumulative	1-2%	Low (Redis lookup + compute)
L3	10ms	70-95% cumulative	0.5-1%	Medium (GBDT inference)
Total	0-15ms	70-95% total	~1-2%	Acceptable

Signal Loss Impact on Fraud Detection:

Fraud detection becomes harder when user_id is unavailable (40-60% of mobile traffic due to ATT/Privacy Sandbox). Without stable identity:

L2 behavioral analysis degrades: Can’t track “this user clicked 50 ads today” - limited to IP/device fingerprint patterns
L3 historical features unavailable: Lifetime clicks, account age, session history all require persistent identity

Mitigation strategies for anonymous traffic:

Lean harder on L1: IP reputation, device fingerprint, and request metadata still available
Publisher-level fraud scoring: Aggregate fraud rates by publisher compensate for missing user signals
Session-level patterns: Short-term behavioral analysis within single anonymous session
Conservative blocking: Lower thresholds for anonymous traffic (accept slightly higher false positive rate)

Expected accuracy degradation: AUC drops from 0.92 (identified) to ~0.82-0.85 (anonymous) - still effective but less precise.

Latency impact on overall SLO:

Fraud detection runs in PARALLEL with ad selection (as shown in the architecture post’s critical path analysis):

Ad serve critical path:

    
    gantt
    title Ad Serve Critical Path (140ms Total)
    dateFormat x
    axisFormat %L

    section Sequential 0-25ms
    Network 10ms               :done, 0, 10
    Gateway 5ms                :done, 10, 15
    User Profile 10ms          :done, 15, 25

    section Parallel Execution
    Ad Selection + ML 65ms     :crit, 25, 90
    Fraud Detection 0-15ms     :active, 25, 40

    section RTB + Final
    RTB Auction 100ms          :crit, 90, 190
    Final Processing 10ms      :done, 190, 200

Key insight: Fraud detection (0-15ms) runs in parallel with Ad Selection + ML (65ms) and completes before the ML path finishes. This means fraud detection adds ZERO latency to the critical path—the request must wait for ML anyway, so fraud checks happen “for free” during that wait time.

Monitoring and Alerting:

Key metrics to track:

Latency metrics: p50, p95, p99 latencies (target: <15ms)
Block rates: Percentage of requests blocked at each tier (L1, L2, L3)
False positive rate: Ratio of complaints to blocks (indicates legitimate users being blocked)
Model quality: AUC score for fraud detection model
Business impact: Estimated advertiser spend protected from fraud

Critical alerts (P1):

Fraud block rate drops below 1% for 1+ hour (suggests model failure)
Fraud block rate spikes above 20% for 15+ minutes (suggests new attack pattern)
False positive rate exceeds 5% (blocking too many legitimate users)

Warning alerts (P2):

Fraud detection latency p99 exceeds 20ms (approaching budget limit)
Model AUC drops below 0.85 (model quality degrading)
New fraud pattern detected (requires manual rule update)

Impact Analysis: Fraud Prevention vs False Positives

Baseline scenario (typical ad platform at scale):

Fraud rate: ~3% of total traffic (click fraud, impression fraud, bot traffic)
Without detection: All fraudulent traffic billed to advertisers → billing disputes, trust erosion, potential legal liability

Fraud detection effectiveness:

If catching 80% of fraud:

Fraud prevented: 80% × 3% = 2.4% of total platform traffic protected from fraudulent billing
Impact: Prevents advertiser disputes, maintains platform trust, ensures compliance with payment processor requirements
Magnitude: At scale, prevented fraud cost represents 5-15% of gross revenue (varies by vertical) and fraud sophistication level

False positive trade-off:

If 2% false positive rate:

Legitimate traffic blocked: 2% × 97% legitimate = 1.94% of total traffic
Impact: Lost impressions, reduced advertiser reach, opportunity cost on legitimate engagement
Magnitude: Represents ~1-2% revenue loss but prevents significantly larger fraud-related losses

Net impact assessment:

Metric	Value	Notes
Fraud prevented	~2.4% of traffic	Prevents 5-15% revenue loss from fraud
False positives	~1.94% of traffic	1-2% revenue opportunity cost
Net benefit	3-13% gross revenue protected	Net positive after false positive cost
Infrastructure overhead	<0.5% of infrastructure spend	Redis, HBase, ML training - negligible vs benefit
ROI multiplier	50-100×	Benefit-to-infrastructure-cost ratio

Decision: Fraud detection is operationally critical. The 2% false positive rate is an acceptable trade-off to prevent fraud-induced billing disputes (which would be catastrophic for advertiser trust and payment processor relationships).

Critical Testing Requirements

Architectural Driver: Financial Accuracy & Trust - Two testing aspects are non-negotiable for ads platforms: proving financial accuracy (≤1% budget overspend) and validating performance at scale (1M+ QPS). Traditional testing approaches miss both.

The testing gap: Unit tests validate individual components. Integration tests validate service interactions. End-to-end tests validate request flows. But none of these prove the two critical claims that make or break an ads platform:

Financial accuracy under distributed contention: Can 300+ servers concurrently decrement shared budgets without violating the ≤1% overspend guarantee?
Performance at scale under realistic load: Does the system actually handle 1M QPS sustained load with P95 < 150ms latency, or does it collapse at 800K?

Why traditional testing insufficient:

Unit tests can’t simulate race conditions across 300 distributed servers
Integration tests run at low QPS (100-1K), missing performance cliffs that only appear at 800K-1M+ QPS
Canary deployments risk significant revenue (10% of platform traffic if billing broken)

This section covers three specialized testing strategies required for financial systems at scale.

Financial Accuracy Validation: Proving the ≤1% Budget Overspend Claim

The Challenge

Part 1’s architecture claims ≤1% budget overspend despite distributed contention. How do we prove this claim before production deployment?

The problem: Multiple servers (300+) concurrently decrementing shared advertiser budgets at 1M QPS creates inevitable race conditions. Without proper atomic operations (Part 3’s Redis DECRBY), budget overspend could reach 50-200% as servers race to approve the last available impressions.

Claim to validate: Part 1 guarantees ≤1% overspend through atomic distributed cache operations. This must be proven under realistic contention, not assumed.

Testing Methodology

Setup:

Test campaigns: 1,000 campaigns with equal budgets
Ad servers: 300 instances (production-scale)
Load generation: 10M ad requests distributed across all servers
Contention strategy: Intentional hot-spotting - route 50% of traffic to top 100 campaigns (simulates popular campaigns receiving disproportionate traffic)

Validation:

Track approved impressions per campaign (count every impression the system serves)
Calculate actual spend: actual_spend = approved_impressions × CPM
Assert: (actual_spend - budget) / budget ≤ 1% for 99.5%+ campaigns

Why this methodology works:

Realistic contention: 300 servers competing for same budgets mirrors production conditions
Hot-spotting: Concentrating 50% traffic on 100 campaigns creates worst-case race conditions
Statistical significance: 1,000 campaigns provides confidence interval for overspend distribution
Binary validation: Either overspend ≤1% (claim proven) or >1% (architecture broken, must fix before launch)

What this validates:

Part 1’s financial accuracy claim (≤1% overspend guarantee)
Part 3’s atomic operations (Redis DECRBY correctness under contention)
Race condition handling (proper distributed coordination)

Why critical: Real advertisers will sue if systematic overspend >1%. This test detects the bug before it costs millions in refunds and destroyed trust.

Historical results from similar systems:

99.8% campaigns within 1% threshold (target: 99.5%)
Edge cases: 1.1-1.3% overspend traced to Redis follower lag during failover (network partition delayed replication)
Verdict: Acceptable within margin (documented as known limitation: “During network partitions, overspend may reach 1.3% for <1 minute”)

Scale Validation: Performance Testing Beyond Unit Tests

The problem: Systems that pass unit tests at 1K QPS often collapse at 800K-1M QPS due to emergent behaviors invisible at low scale.

Examples of scale-only failures:

Cache stampede: 1,000 requests for expired cache key overwhelm database (only appears at high QPS)
Connection pool exhaustion: 10K concurrent connections exceed database limits (low QPS never hits limits)
GC pressure: 250MB/sec allocation at 1M QPS triggers 50ms GC pauses (low QPS has negligible allocation)
Performance cliffs: System handles 800K QPS fine, collapses at 1.1M (non-linear degradation)

Critical Load Scenarios

Scenario 1: Baseline - 1M QPS Sustained (1 hour)

Configuration:

Duration: 1 hour continuous load
Request rate: 1M QPS sustained (no ramp, immediate full load)
Concurrent users: 50K (simulates realistic concurrency)
Traffic distribution: Realistic mix (60% mobile, 30% web, 10% tablet)

Success criteria:

P95 latency < 150ms (SLO from Part 1)
Error rate < 0.5% (acceptable error budget)
No memory leaks (heap usage stable after 10 minutes)
No connection pool exhaustion (all services maintain available connections)

What this validates:

Latency budget claims from Part 1 (150ms P95 holds under sustained load)
Capacity planning (1M QPS baseline sustained without degradation)
Memory management (no leaks, GC stable)

Scenario 2: Burst - 1.5M QPS Spike (Black Friday Simulation)

Configuration:

Ramp: 1M → 1.5M QPS over 5 minutes (realistic traffic spike)
Duration: 30 minutes at 1.5M QPS
Purpose: Validate 50% headroom claim from Part 1

Success criteria:

Auto-scaling triggers within 2 minutes (Kubernetes HPA detects load)
Zero dropped requests (queue depth remains manageable)
P95 latency < 175ms (acceptable 25ms degradation during burst)

What this validates:

Part 1’s 50% headroom claim (1.5M QPS capacity)
Auto-scaling responsiveness (2-minute scale-up)
Queue management (no request drops despite sudden spike)

Scenario 3: Degraded Mode - Simulated Service Failures

Configuration:

Inject failures:
- User Profile Service: +50ms latency (simulates cache miss storm)
- Feature Store: 50% error rate (simulates database outage)
- RTB Gateway: 3 DSPs timeout (simulates external partner issues)
Duration: 15 minutes
Purpose: Validate Part 1’s graceful degradation claims

Success criteria:

Circuit breakers trip within 60 seconds (automatic failure detection)
Graceful degradation activates (fallback to cached predictions, contextual ads)
P95 latency < 200ms (degraded but not catastrophic)
Revenue impact < 30% (Part 1’s composite degradation prediction)

What this validates:

Part 1’s degradation ladder (circuit breakers work as designed)
Fallback paths functional (system doesn’t crash, serves degraded ads)
Revenue impact matches predictions (validates degradation math)

Why these scenarios matter:

Unit tests can’t simulate distributed system behavior at scale:

Cache stampede only appears when 1,000 concurrent requests hit expired key (impossible to replicate with 10 requests)
Performance cliffs are non-linear - system works at 800K, collapses at 1.1M (can’t extrapolate from low-QPS tests)
Emergent behaviors like connection pool exhaustion, GC pressure, network saturation only manifest at production scale

Testing infrastructure: Dedicated load testing cluster (separate from production) with production-equivalent configuration (same instance types, same database sizing, same network topology).

Shadow Traffic: Financial System Validation Without User Impact

Why Standard Canary Insufficient for Financial Systems

Standard canary deployment:

Route 10% real traffic to new version (v1.3.0)
Monitor error rates, latency
If healthy → ramp to 100%

Problem for ads platforms:

Revenue at risk: 10% of platform traffic exposed to potential billing bugs
Detection lag: Typical canary runs 30-60 minutes before promoting
Financial impact: 30-60 minutes of billing errors at 10% traffic scale can represent significant financial exposure and advertiser trust damage
Advertiser trust: Even small billing discrepancies trigger complaints, refund demands, lost contracts

Risk unacceptable: Cannot afford even 10-minute bug detection window for financial code.

Shadow Traffic Approach: Zero User Impact Validation

Pattern:

Primary (v1.2.3): Serves 100% of user traffic (production version)
Shadow (v1.3.0): Receives 10% sampled traffic (new version being validated)
Comparison engine: Compares responses, latency, billing calculations
User impact: ZERO (shadow responses discarded, only primary responses returned to users)

Implementation:

Traffic sampling: API Gateway duplicates 10% of requests
- Original request → Primary service (v1.2.3) → user receives response
- Duplicated request → Shadow service (v1.3.0) → response logged but discarded
Response comparison:
- Billing calculation diff: Shadow charges $5.02, primary $5.00 → flag for investigation
- Latency regression: Shadow P95 = 160ms, primary P95 = 145ms → block deployment
- Response divergence: 0.1% of requests return different ad_id → manual review
Validation metrics:
- Billing accuracy: Shadow vs primary spend must match within 0.1%
- Latency: Shadow P95 must be ≤ primary P95 + 5ms (no regression)
- Error rate: Shadow errors must be ≤ primary errors (no new failures)

Ramp schedule:

Week	Shadow Traffic %	Validation	Decision
Week 1	1%	Stability check (memory leaks, crashes)	Proceed or abort
Week 2	10%	Full validation (billing, latency, errors)	Proceed or abort
Week 3	Canary 10%	Promote to canary only if zero billing discrepancies	Proceed or abort
Week 4+	Ramp to 100%	Standard progressive rollout	Full deployment

Why this works:

Zero financial risk: Shadow traffic doesn’t impact user responses or billing
Early detection: Billing discrepancies found before any real traffic exposure
High confidence: 10% shadow = millions of requests validated before canary
Reversibility: Any issues detected → abort promotion, zero user impact

Trade-off:

Infrastructure cost: Running shadow service doubles compute for 10% of traffic (10% overhead)
Engineering complexity: Comparison engine adds operational complexity
Timeline: Adds 2-3 weeks to deployment cycle (shadow → canary → full rollout)

Value: Preventing a single $100K billing error (30 min bug exposure at 10% canary) pays for years of shadow infrastructure costs.

Shadow Traffic Flow Diagram

    
    graph TB
    PROD[Production Traffic
1M QPS]

    GW[API Gateway
Traffic Splitter]

    PRIMARY[Primary v1.2.3
Serves response to user
100% traffic]

    SHADOW[Shadow v1.3.0
Logs results, discards
10% sampled]

    USER[User receives response]

    COMPARE[Comparison Engine
Latency, Errors, Response Diff
Billing Calculation Validation]

    PROD --> GW
    GW -->|100%| PRIMARY
    GW -->|10% duplicate| SHADOW
    PRIMARY --> USER

    PRIMARY -.->|Metrics| COMPARE
    SHADOW -.->|Metrics| COMPARE

    COMPARE -->|Billing diff detected| ALERT[Alert + Block Promotion]
    COMPARE -->|Latency regression| ALERT
    COMPARE -->|All validations pass| PROMOTE[Promote to Canary]

    style SHADOW fill:#ffffcc
    style COMPARE fill:#ccffff
    style ALERT fill:#ffcccc
    style PROMOTE fill:#ccffcc

Diagram explanation:

Primary service handles all user traffic (production stability)
Shadow service receives duplicated sample (validation without user impact)
Comparison engine validates billing accuracy, latency, error rates
Alerts trigger on any discrepancy (billing diff, latency regression, error rate increase)
Promotion only occurs after passing all validations (high confidence deployment)

Section Conclusion

Three specialized testing strategies for financial systems:

Financial accuracy testing: Validates Part 1’s ≤1% budget overspend claim under distributed contention (300 servers, 10M requests, intentional race conditions)
Scale testing: Validates performance claims at production scale (1M QPS sustained, 1.5M burst, degraded mode scenarios that only manifest at high QPS)
Shadow traffic: Validates financial code changes with zero user impact (catches billing bugs before canary exposure, preventing $15K+ errors)

Why critical:

Standard testing (unit, integration, E2E) cannot detect distributed race conditions, performance cliffs, or billing calculation errors
Ads platforms are financial systems - billing bugs destroy trust and trigger lawsuits
Scale-specific failures (cache stampede, connection exhaustion, GC pressure) only appear at 800K-1M+ QPS

Cross-references:

Part 1: Financial accuracy testing validates the ≤1% budget overspend claim and 1M QPS capacity claim
Part 3: Scale testing validates atomic Redis operations (DECRBY correctness) under 300-server contention

Trade-offs accepted:

Shadow traffic adds 10% infrastructure cost and 2-3 weeks to deployment timeline
Value: Prevents single $100K billing error, pays for itself many times over

With fraud detection protecting against malicious traffic and critical testing validating financial accuracy at scale, the platform is ready for multi-region deployment to ensure high availability.

Multi-Region Deployment and Failover

Architectural Driver: Availability - Multi-region deployment with 20% standby capacity ensures we survive full regional outages without complete service loss. At scale, even 1-hour regional outage represents significant revenue impact. Auto-failover within 90 seconds minimizes impact to <0.1% daily downtime.

Why Multi-Region:

Business drivers:

Latency requirements: Sub-100ms p95 latency is physically impossible with single region serving global traffic. Speed of light: US-East to EU = ~80ms one-way, already consuming 80% of our budget. Regional presence required.
Availability: Single-region architecture has single point of failure. AWS historical data: major regional outages occur 1-2 times per year, averaging 2-4 hours. Single outage can cause significant revenue loss proportional to platform scale and hourly revenue rate.
Regulatory compliance: GDPR requires EU user data stored in EU. Multi-region enables data locality compliance.
User distribution: for example 60% US, 20% Europe, 15% Asia, 5% other. Serving from nearest region reduces latency 50-100ms.

Normal Multi-Region Operation:

Region allocation (Active-Passive Model):

Region	User Base	Normal Traffic	Role	Capacity
US-East-1	40%	400K QPS	Primary US	100% + 20% standby
US-West-2	20%	200K QPS	Secondary US	100% + 20% standby
EU-West-1	30%	300K QPS	EU Primary	100% + 20% standby
AP-Southeast-1	10%	100K QPS	Asia Primary	100% + 20% standby

Deployment model: Active-passive within region pairs. Each region serves local users (lowest latency), can handle overflow from neighbor region (geographic redundancy), but cannot handle full global traffic (cost prohibitive).

Trade-off accepted: 20% standby insufficient for full regional takeover, but enables graceful degradation. Full redundancy (200% capacity per region) would triple infrastructure costs.

Traffic Routing & DNS:

Global load balancing: AWS Route53 with geolocation-based routing + health checks.

Normal operation:

User in New York → routed to US-East-1 (10-15ms latency)
User in London → routed to EU-West-1 (5-10ms latency)
User in Singapore → routed to AP-Southeast-1 (8-12ms latency)

Health check mechanism:

Route53 Health Check Configuration:

Protocol: HTTPS
Path: /health/deep (checks database connectivity, not just simple “alive” response)
Interval: 30 seconds (Standard tier) or 10 seconds (Fast tier)
Failure threshold: 3 consecutive failures before marking unhealthy
Health checkers: 15+ global endpoints test each region
Decision logic: Healthy if ≥18% of checkers report success

Failover trigger: When health checks fail for 90 seconds (3 × 30s interval), Route53 marks region unhealthy and returns secondary region’s IP for DNS queries.

DNS TTL impact: Set to 60 seconds. After failover triggered, new DNS queries immediately return healthy region, existing client DNS caches expire within 60s (50% of clients fail over in 30s, 95% within 90s).

Why 60s TTL: Balance between fast failover and DNS query load. Lower TTL (10s) = 6× more DNS queries hitting Route53 nameservers. At high query volumes, this increases costs, but the primary concern is cache efficiency - shorter TTLs mean resolvers cache records for less time, reducing effectiveness of DNS caching infrastructure.

Health check vs TTL costs: Note that health check intervals (10s vs 30s) have different pricing tiers. The 6× query multiplier applies to DNS resolution, not health checks.

Data Replication Strategy:

CockroachDB (Billing Ledger, User Profiles):

Multi-region replication strategy:

Goal: Survive regional failures while maintaining data consistency and acceptable write latency.

Approach:

Determine survival requirements: What failure scenarios must you tolerate?
- Single AZ failure = need 3 replicas minimum (quorum of 2)
- Single region failure = need cross-region distribution
- Multiple concurrent failures = need higher replication factor
Calculate replication factor: Based on consensus quorum requirements
- Quorum size = floor(replicas / 2) + 1
- To survive N failures and maintain quorum: replicas ≥ 2N + 1
- Example: survive 1 region failure → need at least 3 replicas (quorum=2, can lose 1)
- Example: survive 2 region failures → need at least 5 replicas (quorum=3, can lose 2)
Replica placement strategy: Distribute across regions based on traffic and failure domains
- Place more replicas in high-traffic regions (reduces read latency)
- Ensure geographic diversity (regions should have independent failure modes)
- Balance cost vs resilience (more replicas = higher storage cost)
Trade-offs:
- More replicas = better fault tolerance but higher cost and write latency
- Fewer replicas = lower cost but reduced resilience
- Write latency increases with geographic spread (cross-region = 60-225ms vs same-region = 5-20ms)

Write path: Writes acknowledged when quorum of replicas confirm (Raft consensus). Cross-region write latency ranges 60-225ms (dominated by network RTT).

Read path: Reads served by nearest replica with bounded staleness for eventually-consistent reads (stale reads acceptable for most use cases). Strong-consistency reads must hit the leaseholder (higher latency, but guaranteed fresh data).

Multi-Region Coordination Model

Consistency modes, read routing, and latency impacts per data type:

Data Type	Storage	Consistency Mode	Read Routing	Write Latency Impact
Billing Ledger	CockroachDB	Strong (linearizable)	Leaseholder only (cross-region)	60-225ms (quorum across regions)
Campaign Configs	CockroachDB	Strong (linearizable)	Leaseholder only (cross-region)	60-225ms (quorum across regions)
User Profiles	CockroachDB	Eventual (bounded staleness)	Nearest replica (local region)	60-225ms (async after quorum)
Budget Counters	Redis (regional)	Local strong (per-region)	Local region only (no replication)	<1ms (in-region atomic ops)
User Sessions	Redis (regional)	Local strong (per-region)	Local region only (no replication)	<1ms (in-region atomic ops)
ML Features	Redis (regional)	Eventual (cache)	Local region only (rebuilt from Kafka)	<1ms (local write, eventual consistency)

Key insights: Financial data accepts 60-225ms writes for strong consistency. User Profiles use local replicas (5-10ms reads, seconds staleness). Budget Counters achieve <1ms via regional isolation, accepting bounded loss during failures.

Cross-Region Write Latency Matrix

Measured round-trip time (RTT) between region pairs, showing physical network constraints:

From ↓ / To →	US-East-1	US-West-2	EU-West-1	AP-Southeast-1
US-East-1	5-10ms	60-70ms	65-75ms	210-225ms
US-West-2	60-70ms	5-10ms	115-125ms	160-170ms
EU-West-1	65-75ms	115-125ms	5-10ms	170-180ms
AP-Southeast-1	210-225ms	160-170ms	170-180ms	5-10ms

Write latency = slowest region in quorum set. Strong consistency (Billing, Campaigns): 60-225ms quorum writes. Eventual consistency (Profiles): 5-10ms local write, async propagation. Redis: <1ms local-only, no cross-region sync.

Read Routing Strategy

Data Type	Read Type	Routing Logic	Latency (Regional / Cross-Region)
Billing Ledger	Strong read	Route to leaseholder (may be cross-region)	5-10ms (local) / 60-225ms (remote)
Campaign Configs	Strong read	Route to leaseholder (may be cross-region)	5-10ms (local) / 60-225ms (remote)
User Profiles	Stale read (default)	Nearest replica (always local)	5-10ms (local only)
User Profiles	Fresh read (rare)	Route to leaseholder (may be cross-region)	5-10ms (local) / 60-225ms (remote)
Budget Counters	Atomic read	Local Redis only (no cross-region)	<1ms (local only)
ML Features	Cache read	Local Redis only (no cross-region)	<1ms (local only)

Trade-off: User profile updates written to US-East (5-10ms) may appear stale in EU-West for seconds due to replication lag. Acceptable for targeting (outdated interests don’t impact revenue materially), but cross-region strong reads (65-75ms) would violate 10ms User Profile SLA.

Redis (Budget Pre-Allocation, User Sessions):

CRITICAL ARCHITECTURAL DECISION: Redis does NOT replicate across regions in this design.

Why no cross-region Redis replication:

Latency: Redis replication is synchronous or asynchronous. Synchronous = 50-100ms write latency (violates our <1ms budget enforcement requirement). Asynchronous = data loss during failover.
Complexity: Redis Cluster cross-region replication requires custom solutions (RedisLabs, custom scripts).
Acceptable trade-off: Budget pre-allocations are already bounded loss (see below).

Each region has independent Redis:

US-East Redis: Stores budget pre-allocations for campaigns served in US-East
EU-West Redis: Independent budget allocations for EU campaigns
No cross-region replication

Data Consistency During Regional Failover (CRITICAL):

The Budget Counter Problem: When US-East fails, what happens to budget allocations stored in US-East Redis?

Example scenario:

Campaign daily budget: B_daily
US-East pre-allocated: 0.30 × B_daily (stored in US-East Redis)
US-West pre-allocated: 0.40 × B_daily (stored in US-West Redis)
EU-West pre-allocated: 0.30 × B_daily (stored in EU-West Redis)
US-East fails at 2pm, having spent half of its allocation

What happens:

Immediate impact: Remaining allocation (0.15 × B_daily) in US-East Redis is lost (region unavailable)
US-West takes over US-East traffic: Continues spending from its own allocation
Bounded over-delivery: Max over-delivery = lost US-East allocation = 0.15 × B_daily
Percentage impact: 15% over-delivery (exceeds our 1% target!)

Mitigation: CockroachDB-backed allocation tracking (implemented)

Every 60 seconds, each region writes actual spend to CockroachDB:

Heartbeat update (US-East region, every 60s while healthy):

Update campaign budget tracking table
Set region-specific allocated amount (e.g., 0.30 × B_daily)
Set region-specific spent amount (e.g., 50% of allocation)
Update last heartbeat timestamp to current time
Filter by campaign ID

Failover recovery process:

T+0s: US-East fails
T+90s: Health checks trigger failover, US-West starts receiving US-East traffic
T+120s: Atomic Pacing Service detects US-East heartbeat timeout (last write was 120s ago)
T+120s: Atomic Pacing Service reads last known state from CockroachDB:
- US-East allocated: 0.30 × B_daily
- US-East spent: 50% of allocation (written 120s ago)
- Remaining (uncertain): ~15% of B_daily
T+120s: Atomic Pacing Service marks US-East allocation as “failed” and removes from available budget
Result: 15% of budget locked but not over-delivered

Bounded under-delivery: Max under-delivery = unspent allocation in failed region = 15% of budget.

Concrete example with dollar amounts:

Campaign with $10,000 daily budget:

US-East pre-allocated: $3,000 (30%)
US-West pre-allocated: $4,000 (40%)
EU-West pre-allocated: $3,000 (30%)
US-East fails after spending $1,500 (50% of allocation)
Lost allocation: $1,500 remaining in US-East Redis
System response: Atomic Pacing Service locks the lost $1,500
Result: Campaign delivers $8,500 worth of ads instead of $10,000
Outcome: 15% under-delivery → refund advertiser $1,500

Why under-delivery is acceptable:

Advertiser complaint: “I paid for $10K budget, only got $8.5K delivered” → refund $1,500 difference
Better than over-delivery: “I paid for $10K budget, you charged me $11,500” → lawsuit risk, regulatory violations

Failure Scenario: US-East Regional Outage

Scenario: Primary region (US-East) fails, handling 40% of traffic. What happens?

Failover Timeline:

Time	Event	System State
T+0s	Health check failures detected	DNS TTL delay (60s)
T+30s	3× traffic hits US-West	CPU: 40%→85%, standby activated
T+60s	Auto-scaling triggered	Provisioning new capacity
T+90s	Cache hit degradation	Latency p95: 100ms→150ms
T+90s	Route53 marks US-East unhealthy	DNS failover begins
T+90-100s	New instances online	Capacity restored (30-40s provisioning after T+60s trigger)
T+120s	Atomic Pacing Service locks US-East allocations	Under-delivery protection active

Why 20% Standby is Insufficient:

The timeline above shows a critical problem: from T+30s to T+90-100s (60-70 seconds with modern tooling), US-West is severely overloaded. To understand why, we need queueing theory.

Capacity Analysis:

Server utilization in queueing theory: $$\rho = \frac{\lambda}{c \mu}$$

where:

$\lambda$ = arrival rate (requests per second)
$c$ = number of servers
$\mu$ = service rate per server
$\rho$ = utilization (0 to 1+ scale)

Critical thresholds:

$\rho < 0.8$: Stable operation, reasonable queue lengths
$0.8 < \rho < 1.0$: Queues grow, latency increases
$\rho \geq 1.0$: System unstable, queues grow unbounded

Normal operation (US-West):

Traffic: 200K QPS
Capacity: 300K QPS (with 20% standby)
$\rho = 200K / 300K = 0.67$ (stable)

During US-East failure (US-West receives 40% of total traffic):

Traffic: 200K + 400K = 600K QPS
Capacity: 300K QPS (20% standby already activated)
$\rho = 600K / 300K = 2.0$ (severe overload)

Auto-scaling limitations: Kubernetes HPA triggers at T+60s, but provisioning new capacity takes 30-40 seconds for GPU-based ML inference nodes with modern tooling (pre-warmed images, model streaming, VRAM caching). Without optimization, this can extend to 90-120s (cold pulls, full model loading). During this window, the system operates at 2× over capacity, making graceful degradation essential.

Mitigation: Graceful Degradation + Load Shedding

Architectural Driver: Availability - During regional failures, graceful degradation (serving stale cache, shedding low-value traffic) maintains uptime while minimizing revenue impact. Better to serve degraded ads than no ads.

The system employs a two-layer mitigation strategy:

Layer 1: Service-Level Degradation (Circuit Breakers)

ML Inference: Switch to cached CTR predictions (-8% revenue impact)
User Profiles: Serve stale cache with 5-minute TTL (-5% impact)
RTB Auction: Reduce to top 20 DSPs only (-6% impact)

Layer 2: Load Shedding (Utilization-Based)

When utilization exceeds capacity despite degradation:

Utilization	Action	Logic
<70%	Accept all	Normal operation
70-90%	Accept all + degrade services	Circuit breakers active, auto-scaling triggered
>90%	Value-based shedding	Accept high-value (>P95), reject 50% of low-value

Combined impact during regional failover:

Service degradation: ~27% revenue reduction (from circuit breaker activations)
Load shedding (if needed): Reject 47.5% of lowest-value traffic, preserve 97.5% of remaining revenue
Net result: System stays online, handles capacity constraint within 30-40s auto-scaling window (modern tooling) or 90-120s (legacy deployments)

Failback Strategy:

After US-East recovers, gradual traffic shift back:

Automated steps:

T+0: US-East infrastructure restored, health checks start passing
T+5min: Route53 marks US-East healthy again, BUT weight set to 0%
T+5min: Manual verification: Engineering team checks metrics, error rates
T+10min: Traffic ramp begins: 5% → 10% → 25% → 50% → 100% over 30 minutes
T+40min: Full traffic restored to US-East

Manual gates: Failback is semi-automatic. Requires manual approval at each stage to prevent cascade failures.

Data reconciliation:

CockroachDB: Already consistent (Raft consensus maintained across regions). Redis: Rebuild from scratch (Atomic Pacing Service re-allocates budgets based on CockroachDB source of truth, cold cache for 10-20 minutes).

Why gradual failback: Prevents “split-brain” scenario where both regions think they’re primary.

Cost Analysis: Multi-Region Economics

Infrastructure cost multipliers:

Component	Single Region	Multi-Region (4 regions)	Multiplier
Compute (ad servers, ML)	Baseline	3× baseline	3×
CockroachDB (5 replicas)	Baseline	3× baseline	3×
Redis (per region)	Baseline	3× baseline	3×
Cross-region data transfer	None	30% of baseline	Significant (new cost category)
Route53 (health checks)	Baseline	3× baseline	3×
Total	Baseline	3.3× baseline	3.3×

Cross-region data transfer breakdown:

CockroachDB replication: 5 replicas × request volume × average payload size
Metric/log aggregation: Centralized monitoring across regions
Backup replication: Cross-region redundancy

Key cost drivers:

Linear scaling (3×): Compute, databases, cache replicate fully per region
New cost category: Cross-region data transfer (~30% of baseline compute costs)
Marginal costs: DNS health checks scale linearly but are negligible

Economic justification:

Single region annual risk:

Regional outages: 1-2 per year (AWS historical average)
Average duration: 2-4 hours
Infrastructure availability: 99.8-99.9% (accounting for regional outages)

Multi-region infrastructure availability: 99.99%+ (survives full regional failures)

Note: Our service SLO remains 99.9% regardless of deployment strategy. Multi-region provides availability headroom - the infrastructure supports higher uptime than we commit to users, providing buffer for application-level failures.

Trade-off analysis:

Multi-region additional cost: 2.3× baseline annual infrastructure cost
Benefits: +0.1-0.2% infrastructure availability improvement, 50-100ms latency reduction for international users, GDPR compliance
Break-even: Multi-region pays off if single regional outage costs exceed 2.3× annual infrastructure baseline

Intangible benefits:

Reputation protection (uptime matters for advertiser trust)
Regulatory compliance (GDPR data locality requirements)
Competitive advantage (global latency consistency)

Decision: Multi-region worth the 3.3× cost multiplier for platforms where revenue rate justifies availability investment.

Note on cost multiplier breakdown: The 3.3× figure is derived from:

3 active regions × 100% compute = 3.0× (US-East, US-West, EU)
Cross-region data transfer ≈ +0.3× baseline (CockroachDB Raft replication, Kafka mirroring, CDN egress)
Shared control plane ≈ -0.2× savings (observability stack, CI/CD, model training run once)
Passive standby region (APAC) adds +0.2× for data replication only
Total: 3.3× (range: 2.5-4× depending on active-active vs active-passive architecture)

Industry validation: Dual-region setups cost 1.3-2× (not 2×) due to shared infrastructure. For 4-region deployments, the multiplier falls between 3-3.5× based on documented case studies. This estimate is order-of-magnitude accurate but workload-dependent.

Capacity conclusion: 20% standby insufficient for immediate regional takeover, but combined with auto-scaling (30-40s with modern tooling, 90-120s legacy) and graceful degradation, provides cost-effective resilience. Alternative (200% over-provisioning per region) would reach 8-10× baseline costs. Trade-off: Accept degraded performance and bounded under-delivery during rare regional failures rather than excessive capacity overhead.

Schema Evolution: Zero-Downtime Data Migration

The Challenge:

You’ve been running your CockroachDB-based user profile store for 18 months. It’s grown to 4TB across 60 nodes. Now the product team wants to add a complex new feature that requires fundamental schema changes:

Add new column for user preferences (JSONB structure)
Modify table partitioning to include region for data locality compliance (GDPR)
Add secondary index on last_active_timestamp for better query performance

The constraint: Zero downtime. You can’t take the platform offline for migration.

Why Schema Evolution in Distributed SQL:

CockroachDB (distributed SQL) provides native schema migration support with ALTER TABLE, but large-scale changes still require careful planning:

Online schema changes - CockroachDB supports most DDL operations without blocking (ADD COLUMN, CREATE INDEX with CONCURRENTLY)
Strong consistency - ACID guarantees mean no dual-schema reads (unlike eventual consistency systems)
Massive scale - 4TB migration for index backfill = 2-4 hours with proper throttling
Version compatibility - Application code should use backward-compatible queries during rolling deployment

Zero-Downtime Migration Strategy:

Phase 1: Add Column (Non-blocking - Day 1)

CockroachDB supports online schema changes with ALTER TABLE:

Schema change (non-blocking, returns immediately):

Add new JSONB column to user_profiles table
Column name: preferences
Default value: empty JSON object
Backfill happens asynchronously
Reads see NULL or default during backfill period

Application code updated to write to new column immediately. Reads handle both NULL (old rows) and populated (new rows) gracefully.

No dual-write complexity: ACID transactions guarantee consistency - either transaction sees new schema or old schema, never inconsistent state.

Phase 2: Add Index (Background with throttling - Week 1-2)

Create index with CONCURRENTLY to avoid blocking writes:

Index creation (concurrent, non-blocking):

Create index on user_profiles table
Index name: idx_last_active
Indexed column: last_active_timestamp
Runs in background without blocking writes
Uses concurrent mode to avoid table locks

Index backfill rate:

CockroachDB throttles background index creation to ~25% of cluster resources to avoid impacting production traffic. For 4TB data:

$$T_{index} = \frac{4000 \text{ GB}}{100 \text{ MB/s} \times 0.25} \approx 4-6 \text{ hours}$$

Monitor progress: SHOW JOBS displays percentage complete and estimated completion time.

Phase 3: Partition Restructuring (Complex - Week 2-4)

Modifying table partitioning (adding region to partition key) requires creating new table with desired partitioning, then migrating data. This is the only operation that requires dual-write pattern:

Create new partitioned table (user_profiles_v2):

Columns: user_id (UUID), region (STRING), plus all existing columns
Primary key: Composite key (region, user_id)
Partitioning strategy: LIST partitioning by region
Partitions:
- US partition: Contains rows where region = ‘US’
- EU partition: Contains rows where region = ‘EU’
- ASIA partition: Contains rows where region = ‘ASIA’

Dual-write application logic (temporary, 2-4 weeks):

Write to both user_profiles and user_profiles_v2
Read from user_profiles (authoritative)
Background job migrates historical data
After validation, switch reads to user_profiles_v2
Drop user_profiles

Why this is simpler than Cassandra:

ACID transactions eliminate consistency issues during migration
No token range management - just batch SELECT/INSERT
Built-in backpressure and throttling mechanisms

Rollback Strategy:

At any point during migration, rollback is possible:

Phase	Rollback Complexity	Max Data Loss
Phase 1-2 (Dual-write)	Easy - flip read source back to old schema	0 (both schemas current)
Phase 3-4 (Gradual cutover)	Medium - revert traffic percentage	0 (still dual-writing)
Phase 5 (Cleanup started)	Hard - restore from archive	Up to 90 days if archive corrupted

Critical lesson: Keep dual-write active for 2+ weeks after full cutover to ensure new schema stability before cleanup.

CockroachDB-Specific Advantages:

Online schema changes:

CockroachDB performs most schema changes online without blocking - adding columns, creating indexes, and modifying constraints happen in the background while applications continue to operate normally.

Partition restructuring complexity:

Changing primary key requires full rewrite - you can’t update partition key in place:

Schema change:

Old schema: PRIMARY KEY (user_id)
New schema: PRIMARY KEY ((region, user_id))

This requires complete data copy to new table with reshuffling across nodes. Plan for 2-4 week migration window for large datasets (estimate varies based on data volume, cluster capacity, and acceptable impact on production traffic).

Trade-off Analysis: Zero-Downtime vs Maintenance Window Migration

Context: Database schema changes (like changing primary keys or sharding strategies) require data migration. The choice is between engineering complexity (zero-downtime) vs business impact (downtime).

Option A: Zero-downtime migration (described above)

Timeline: ~8 weeks (2 weeks dual-write setup + 4 weeks background migration + 2 weeks validation/cutover)
Engineering investment: ~2 Senior/Staff engineers × 8 weeks (0.3-0.4 engineer-years)
Additional overhead: Test infrastructure, dual-write complexity, extensive validation
Risk profile: Low - gradual rollout with continuous validation and rollback capability
Business impact: Zero downtime - platform remains fully operational throughout

Option B: Maintenance window migration

Timeline: 12-24 hour downtime window (optimistic estimate - issues can extend this significantly)
Engineering investment: ~1 engineer × 2 weeks preparation + execution window
Simplicity: Direct data copy - simpler implementation, less code complexity
Risk profile: Medium-High - single point of failure, rollback requires restoration from backup
Business impact: 12-24 hours complete downtime = loss of 12-24 days worth of hourly revenue (calculated as: hourly rate × 24 hours = equivalent daily revenue × 12-24)

Decision framework:

Factor	Zero-Downtime	Maintenance Window
Engineering cost	0.3-0.4 engineer-years	~0.05 engineer-years
Complexity	High (dual-write, background sync)	Low (direct copy)
Business impact	Zero downtime	12-24 days of hourly revenue lost
Cost ratio	1× (baseline)	40-70× revenue impact vs engineering cost

Decision: For revenue-generating platforms at scale, zero-downtime migration is economically justified by 40-70×. The engineering investment (0.3-0.4 engineer-years) is negligible compared to downtime impact (weeks of revenue compressed into 12-24 hours).

This conclusion holds across wide parameter ranges: even if engineering costs are 2× higher or platform traffic is 5× lower, zero-downtime migration remains the optimal choice for business-critical systems.

Distributed Clock Synchronization and Time Consistency

Architectural Driver: Financial Accuracy - Clock skew across regions can cause budget double-allocation or billing disputes. HLC + bounded allocation windows guarantee deterministic ordering for financial transactions.

Problem: Multi-region systems require accurate timestamps for budget tracking and billing reconciliation. Clock drift (1-50ms/day per server) causes billing disputes, budget race conditions, and causality violations. Without synchronization, 1000 servers can diverge by 50s in one day.

Solution Spectrum: NTP → PTP → Global Clocks

Technology	Accuracy	Cost	Use Case
NTP Network Time Protocol	±50ms (public), ±10ms (local)	Free	General-purpose time sync
PTP Precision Time Protocol	±100μs	Medium (hardware switches)	High-frequency trading, telecom
GPS-based Clocks	±1μs	High (GPS receivers per rack)	Critical infrastructure
Google Spanner TrueTime	±7ms (bounded uncertainty)	Very high (proprietary)	Global strong consistency
AWS Time Sync Service	<100μs (modern instances) ±1ms (legacy)	Free (on AWS)	Cloud deployments (Nitro system 2021+)

Multi-tier time synchronization:

Tier 1 - Event Timestamping: AWS Time Sync (<100μs with modern instances, ±1ms legacy, free). Network latency (20-100ms) dwarfs clock skew, making NTP sufficient for impressions/clicks.

Tier 2 - Financial Reconciliation: CockroachDB built-in HLC provides automatic globally-ordered timestamps: $HLC = (t_{physical}, c_{logical}, id_{node})$. Guarantees causality preservation (if A→B then HLC(A) < HLC(B)) and deterministic ordering via logical counters + node ID tie-breaking.

Clock skew mitigation: Create 200ms “dead zone” at day boundaries (23:59:59.900 to 00:00:00.100) where budget allocations are forbidden. Prevents regions with skewed clocks from over-allocating across day boundaries.

Architecture decision: AWS Time Sync (<100μs with modern instances, ±1ms legacy, free) + CockroachDB built-in HLC. Google Spanner’s TrueTime (±7ms) not worth complexity given 20-100ms network variability.

Note on AWS Time Sync accuracy: AWS upgraded Time Sync Service in 2021. Current-generation EC2 instances (Nitro system, 2021+) achieve <100μs accuracy using PTP hardware support. Older instance types (pre-2021 AMIs) see ±1ms. For this architecture, assume modern instances (<100μs). If using legacy infrastructure, adjust HLC uncertainty interval accordingly (see CockroachDB --max-offset flag).

Advantage: Eliminates ~150 lines of custom HLC code, provides battle-tested clock synchronization.

Monitoring: Alert if clock offset >100ms, HLC logical counter growth >1000/sec sustained, or budget discrepancy >0.5% of daily budget.

Global Event Ordering for Financial Ledgers: The External Consistency Challenge

Architectural Driver: Financial Accuracy - Financial audit trails require globally consistent event ordering across regions. CockroachDB’s HLC-timestamped billing ledger provides near-external consistency, ensuring that events are ordered chronologically for regulatory compliance. S3 + Athena serves as immutable cold archive for 7-year retention.

The Problem: Global Event Ordering

Budget pre-allocation (Redis) solves fast local enforcement, but billing ledgers require globally consistent event ordering across regions. Without coordinated timestamps, audit trails can show incorrect event sequences.

Example: US-East allocates budget amount A (T1), EU-West spends A exhausting budget (T2). Without coordinated timestamps, separate regional databases using local clocks might timestamp T1 after T2 due to clock skew, showing wrong ordering in audit logs.

Solution: CockroachDB HLC-Timestamped Ledger

CockroachDB provides near-external consistency using Hybrid Logical Clocks: $$HLC = (pt, c)$$ where pt = physical time, c = logical counter.

Guarantee: Causally related transactions get correctly ordered timestamps via Raft consensus. CockroachDB’s HLC uncertainty interval is dynamically bounded - legacy deployments use 500ms max_offset (default), but modern deployments with AWS Time Sync achieve <2ms uncertainty (500× improvement, see CockroachDB issue #75564). Independent transactions within this uncertainty window may have ambiguous ordering, but this is acceptable - even with 2ms uncertainty, network latency (60-225ms) already dominates, and causally related events (same campaign) are correctly ordered.

Requirements met:

SOX/MiFID regulatory compliance (chronologically ordered financial records, 5-7 year retention)
Legal dispute resolution (“Did impression X happen before budget exhaustion?”)
Audit trail correctness for billing reconciliation

Architecture Decision: Three-Tier Financial Data Storage

    
    graph LR
    ADV["Ad Server
1M QPS
Local budget: 0ms"]
    REDIS[("Tier 1: Redis
Atomic DECRBY
Allocation only")]
    CRDB[("Tier 2: CockroachDB
HLC Timestamps
10-15ms
90-day hot")]
    S3[("Tier 3: S3 Glacier + Athena
Cold Archive
7-year retention")]

    ADV -.->|"Allocation request
Every 30-60s (async)"| REDIS
    REDIS -->|"Reconciliation
Every 5 min"| CRDB
    CRDB -->|"Nightly archive
Parquet format"| S3

    classDef fast fill:#e3f2fd,stroke:#1976d2
    classDef ledger fill:#fff3e0,stroke:#f57c00
    classDef archive fill:#f3e5f5,stroke:#7b1fa2

    class REDIS fast
    class CRDB ledger
    class S3 archive

Why This Three-Tier Architecture:

Tier	Technology	Purpose	Consistency Requirement
Local Counter	In-memory CAS	Per-request spend tracking (0ms)	Atomic in-memory operations
Tier 1: Allocation	Redis	Global budget allocation (async)	Atomic DECRBY/INCRBY
Tier 2: Billing Ledger	CockroachDB	Financial audit trail with global ordering	Serializable + HLC ordering
Tier 3: Cold Archive	S3 Glacier + Athena	7-year regulatory retention	None (immutable archive)

Workflow:

Per-request spend (1M QPS): Local in-memory counter increment (0ms, not in critical path)
Allocation request (every 30-60s): Ad Server requests budget chunk from Redis via DECRBY (async)
Reconciliation (every 5min): Ad Server reports spend to CockroachDB with HLC timestamps
Nightly archival: Export 90-day-old records to S3 Glacier in Parquet format (7-year retention, queryable via Athena for compliance audits)

Cost Analysis:

Component	Technology	Relative Cost
Fast path	Redis Cluster (20 nodes)	18-22%
Billing ledger (90-day hot)	CockroachDB (60-80 nodes)	77-80%
Cold archive (7-year)	S3 Glacier + Athena	1-2%
Total financial storage		100% baseline

Why S3 Glacier + Athena over PostgreSQL:

Cost: S3 Glacier is 50-100× cheaper than active database storage for cold data
Compliance queries: SOX/MiFID audits happen quarterly/annually, not daily - Athena query latency (seconds) is acceptable
Operational complexity: No database to operate, patch, backup, or scale
Query capability: Athena provides SQL interface for regulatory audits without maintaining a running database
Immutability: S3 Object Lock enforces WORM (Write-Once-Read-Many) for regulatory compliance

Build vs Buy: Custom PostgreSQL + HLC implementation costs 1-1.5 engineer-years plus ongoing maintenance. CockroachDB’s premium (20-30% of financial storage baseline) eliminates upfront engineering cost and operational burden. For cold archive, S3 + Athena is the clear choice - no operational burden and 50-100× cheaper than running a database.

Financial Audit Log Reconciliation

Purpose: Verify operational ledger (CockroachDB) matches immutable audit log (ClickHouse) to detect data inconsistencies, event emission bugs, or system integrity issues before they compound into billing disputes.

Dual-Ledger Architecture:

    
    graph TB
    ADV[Budget Service
Ad Server]

    ADV -->|"1 - Direct write
Transactional"| CRDB[("CockroachDB
Operational Ledger
90-day hot")]
    ADV -->|"2 - Publish event
Async"| KAFKA[("Kafka
Financial Events")]
    KAFKA -->|"Stream"| CH[("ClickHouse
Immutable Audit Log
7-year retention")]

    RECON[Reconciliation Job
Daily 2:00 AM UTC]
    CRDB -.->|"Aggregate yesterday"| RECON
    CH -.->|"Aggregate yesterday"| RECON

    RECON -->|"99.999% match"| OK[No action]
    RECON -->|"Discrepancy detected"| ALERT[Alert Finance Team
P1 Page]
    ALERT --> INVESTIGATE[Investigation:
- Kafka lag 85%
- Schema mismatch 10%
- Event bug 5%]

    classDef operational fill:#fff3e0,stroke:#f57c00
    classDef audit fill:#e8f5e9,stroke:#388e3c
    classDef stream fill:#e3f2fd,stroke:#1976d2
    classDef check fill:#f3e5f5,stroke:#7b1fa2

    class CRDB operational
    class CH audit
    class KAFKA stream
    class RECON,ALERT,INVESTIGATE check

Daily Reconciliation Job (automated, runs 2:00 AM UTC):

Step 1: Query Both Systems

Extract previous 24 hours of financial data from both ledgers:

CockroachDB (Operational): Aggregate campaign charges by summing amounts from billing ledger for previous day, grouped by campaign
ClickHouse (Audit): Aggregate financial events (budget deductions, impression charges) from audit trail for previous day, grouped by campaign

Step 2: Compare Aggregates

Per-campaign validation with acceptable tolerance:

Match criteria: Absolute difference between operational and audit totals must be less than the greater of (1 cent or 0.001% of operational total)
Rationale: Allows rounding differences and sub-millisecond timing variations between systems
Expected result: 99.999%+ campaigns match (0-3 discrepancies out of 10,000+ active campaigns, production measurements)

Step 3: Alert on Discrepancies

Automated notification when thresholds exceeded:

P1 page to finance team: Campaign IDs with mismatches, delta amounts, percentage variance
Dashboard visualization: Total campaigns affected, aggregate delta, trend analysis (increasing discrepancies indicate systemic issue)
Automated ticket creation: Jira issue with forensic query suggestions pre-populated

Step 4: Investigation Workflow

Forensic analysis to identify root cause:

Drill-down query: Retrieve all transactions for affected campaignId from both systems ordered by timestamp
Event correlation: Match requestId between operational logs and audit trail to identify missing/duplicate events
Common causes identified:
- Kafka lag (85% of discrepancies): Event delayed >24 hours due to consumer backlog → resolves automatically when ClickHouse catches up
- Schema mismatch (10%): Field name change in event emission without updating ClickHouse parser → fix parser, backfill missing events
- Event emission bug (5%): Edge case where Budget Service fails to emit event → fix bug, manual INSERT into ClickHouse with audit trail explanation

Step 5: Resolution

Manual intervention when automated reconciliation fails:

If CockroachDB correct: Backfill missing event to ClickHouse with audit metadata (source, reason, approver identity, ticket reference)
If ClickHouse correct: Investigate CockroachDB data corruption (extremely rare), restore from backup if needed, update operational ledger with correction entry

Compliance Verification

Quarterly Audit Preparation:

External auditor access workflow:

Export ClickHouse data: Generate Parquet files for audit period (e.g., Q4 2024: Oct 1 - Dec 31)
Cryptographic verification: Run hash chain validation across exported dataset, produce merkle tree root hash as tamper-evident seal
Auditor query interface: Provide read-only Metabase dashboard with pre-built queries (campaign spend totals, refund analysis, dispute history)
Documentation bundle: Reconciliation job logs, discrepancy resolution tickets, system architecture diagrams

SOX Control Documentation:

Segregation of Duties:

DBAs: Cannot modify ClickHouse audit log (read-only access enforced via IAM roles)
Finance team: Query-only access to both systems, no INSERT/UPDATE/DELETE privileges
Engineering team: Can deploy code changes but cannot directly modify financial data
Audit trail: All ClickHouse schema changes logged in separate audit table with approver identity and business justification

Change Audit:

Administrative operations on financial data systems logged separately:

CockroachDB schema changes: Table alterations logged with timestamp, user, justification, approval ticket
ClickHouse partition operations: Partition drops (only operation allowing data removal) require two-person approval and logged with business justification
Access control changes: IAM role modifications logged and reviewed quarterly

Access Control Matrix:

Role	CockroachDB	ClickHouse	Kafka	Permitted Operations
Budget Service	Write-only	No access	Publish events	INSERT billing records
Finance Team	Read-only	Read-only	No access	Query, export, reporting
DBA Team	Admin	Read-only	Admin	Schema changes, performance tuning
Audit Team	Read-only	Read-only	Read-only	Compliance verification
Engineering	Read-only (production)	Read-only	Read-only	Debugging, monitoring

Retention Policy Enforcement:

Automated Archival (runs monthly):

Data lifecycle management ensuring compliance while optimizing costs:

Age detection: Identify partitions older than 7 years based on timestamp conversion to year-month format
Export to cold storage: Write partition data to S3 Glacier in Parquet format with WORM (Write-Once-Read-Many) configuration
External table creation: Create ClickHouse external table pointing to S3 location (data remains queryable via standard SQL but stored at 1/50th cost)
Partition drop: Remove from ClickHouse hot storage after S3 export verified (logged as administrative action)
Verification: Monthly job validates S3 object count matches dropped partitions, alerts if mismatch detected

Cost Impact:

Retention policy reduces storage costs while maintaining compliance accessibility:

Active ClickHouse storage (0-7 years): 180TB at standard ClickHouse rates
Cold storage (>7 years): S3 Glacier at ~2% of active storage cost
Query capability: Athena or ClickHouse external tables provide SQL interface to cold data (seconds latency acceptable for historical compliance queries)

Budget Reconciliation & Advertiser Compensation Workflow

Architectural Driver: Financial Accuracy - Automated discrepancy detection, retroactive correction, and advertiser compensation workflows ensure billing accuracy ≤1% while maintaining trust. Manual intervention only for exceptions >2% of budget.

The Problem: Budget Overspend & Underspend Edge Cases

Despite bounded micro-ledger (BML) architecture with 0.5-1% inaccuracy bounds, edge cases cause billing discrepancies:

Root causes:

Redis failover: Regional failure with unsynced counter state (15% under-delivery risk per Part 4 multi-region section)
Network partitions: Split-brain scenario where regions can’t sync budget state (bounded by allocation window: max 5min × allocation rate)
Clock skew beyond bounds: HLC uncertainty >2ms in legacy deployments (rare with AWS Time Sync, but possible during NTP failures)
Race conditions at day boundary: Multiple regions allocating final budget chunks simultaneously (mitigated by 200ms dead zone, but not eliminated)
Software bugs: Event emission failures, counter drift, schema evolution issues

Financial trust requirement: This platform targets ≤1% budget variance (stricter than industry-wide ad discrepancy standards of 1-10%, per IAB guidelines). Enterprise advertisers set hard daily budgets and expect strict enforcement. Advertisers tolerate 1-2% variance without complaint, escalate at 2-5%, and demand refunds/credits >5%. Automation required to handle 95%+ of cases without manual intervention.

Preventive Measures During Edge Cases

Real-time throttling bounds overspend during active failures (reconciliation handles post-hoc correction):

Network partition throttling: Detect sync failure (CockroachDB heartbeat >120s, Redis lag >5s) → reduce allocation to 50% rate per region. With throttling: 3 regions at 50% = 0.175% overspend ($17.50 on $10K daily budget for 5-min window). Without throttling: 0.7% overspend ($70). Throttling reduces exposure by 75%.

Clock skew protection: 200ms dead zone at day boundaries (23:59:59.900 to 00:00:00.100) prevents double-allocation when region clocks differ by ±200ms.

Race condition mitigation (low budget <5%): Pessimistic locking (CockroachDB SELECT FOR UPDATE) serializes allocation requests. Failed regions retry with 50% allocation size, accepting uneven distribution over overspend.

Reconciliation Architecture

The system uses a four-stage pipeline to detect, classify, correct, and compensate for budget discrepancies:

    
    graph TB
    subgraph "Stage 1: Detection (Every 5 min)"
        REDIS[("Redis Counters
Live Spend")]
        CRDB[("CockroachDB
Billing Ledger")]

        RECON_LIVE[Live Reconciliation Job
Compare Redis vs CockroachDB]
        REDIS --> RECON_LIVE
        CRDB --> RECON_LIVE

        RECON_LIVE -->|Δ ≤ 1%| OK1[No action]
        RECON_LIVE -->|1% < Δ ≤ 2%| WARN[Log warning]
        RECON_LIVE -->|Δ > 2%| ALERT[P2 Alert]
    end

    subgraph "Stage 2: Classification (Daily 2 AM UTC)"
        DAILY[Daily Reconciliation
Final spend vs budget]
        CRDB --> DAILY

        DAILY -->|Exact match| OK2[No action]
        DAILY -->|Underspend| UNDER{Amount?}
        DAILY -->|Overspend| OVER{Amount?}

        UNDER -->|< 1%| ACCEPT_U[Accept
Log only]
        UNDER -->|≥ 1%| CREDIT[Issue Credit]

        OVER -->|≤ 1%| ACCEPT_O[Accept
Bounded by design]
        OVER -->|> 1%| REFUND[Issue Refund]
    end

    subgraph "Stage 3: Correction"
        CREDIT --> LEDGER_ADJ[Ledger Adjustment
CockroachDB]
        REFUND --> LEDGER_ADJ

        LEDGER_ADJ --> AUDIT[Audit Log Entry
Immutable ClickHouse]
    end

    subgraph "Stage 4: Compensation"
        AUDIT -->|Underspend ≥ 1%| AUTO_CREDIT[Automated Credit
Advertiser Account]
        AUDIT -->|Overspend > 1%| AUTO_REFUND[Automated Refund
Payment Gateway]

        AUTO_CREDIT --> NOTIFY[Email Notification
+ Dashboard Update]
        AUTO_REFUND --> NOTIFY

        NOTIFY -->|Any > 2%| MANUAL_REVIEW[Manual Review
Finance Team]
    end

    classDef detection fill:#e3f2fd,stroke:#1976d2
    classDef classification fill:#fff3e0,stroke:#f57c00
    classDef correction fill:#e8f5e9,stroke:#388e3c
    classDef compensation fill:#f3e5f5,stroke:#7b1fa2

    class RECON_LIVE,REDIS,CRDB detection
    class DAILY,UNDER,OVER classification
    class LEDGER_ADJ,AUDIT correction
    class AUTO_CREDIT,AUTO_REFUND,NOTIFY,MANUAL_REVIEW compensation

Stage 1: Discrepancy Detection (Live + Daily)

Live (every 5 min): Compare Redis counters vs CockroachDB ledger. Thresholds: ≤1% no action, 1-2% log warning, >2% P2 alert.

Daily (2 AM UTC): End-of-day reconciliation of final spend vs budget. Classify as UNDERSPEND/OVERSPEND/EXACT. Process only variances >0.5%.

Stage 2: Classification & Decision Logic

Underspend scenarios:

Variance	Root Cause	Action	Justification
<0.5%	Normal allocation granularity	Accept	Advertiser unlikely to notice
0.5-1%	Redis sync lag, allocation rounding	Accept + log	Within industry tolerance
1-5%	Regional failover (bounded loss)	Issue credit	Advertiser paid for undelivered impressions
>5%	Software bug or manual pause	Issue credit + investigate	Significant revenue loss to advertiser

Overspend scenarios:

Variance	Root Cause	Action	Justification
≤0.5%	BML inaccuracy bound (by design)	Accept	Within contractual SLA (≤1%)
0.5-1%	Day boundary race, clock skew	Accept + log	Within industry tolerance
1-2%	Network partition, extended sync failure	Issue refund	Advertiser charged for unauthorized spend
>2%	Software bug (counter drift, event loss)	Issue refund + P1 incident	Contractual breach, potential legal risk

Key principle: Conservative advertiser protection. Under-delivery requires credit. Over-delivery requires refund even within 1% bound.

Stage 3: Retroactive Correction

All corrections recorded as immutable audit trail. Atomic transaction: (1) Insert adjustment entry (type, amount, reason_code, timestamps, audit_reference), (2) Update campaign summary (corrected_spend, correction_count, timestamp), (3) Emit to ClickHouse via Kafka.

ClickHouse audit trail: Permanent record with correction_id, campaign/advertiser IDs, financial data (budget, actual, variance), classification (OVERSPEND/UNDERSPEND), root_cause, compensation_status, timestamps, metadata. Partitioned by month, append-only.

Stage 4: Advertiser Compensation Automation

Credits (underspend ≥1%): Calculate → apply to account balance → record transaction → email notification → dashboard update.

Refunds (overspend >1%): Calculate → submit to payment gateway (Stripe/Braintree) → record transaction → email notification → dashboard shows 5-10 day ETA.

Manual review (>2% variance): Create Jira ticket, Slack #finance-alerts, flag account “Under Review”, hold new campaigns. Finance team verifies root cause, confirms calculation, reviews payment history, approves/rejects compensation, documents resolution.

Advertiser Dashboard Impact

Campaign detail view shows: campaign ID/name/date, budget commitment, actual delivery, variance ($ and %), status flag (Exact/Under-delivered/Over-delivered with color coding).

Correction display (≥1% variance): Compensation type/amount, timestamp (local + UTC), status (Applied/Pending/Under Review), plain-language explanation (e.g., “infrastructure maintenance” not “Redis failover”), resolution action, next steps.

Example: $10K budget, $8.5K actual → amber “Under-delivered” flag, $1.5K credit applied, message: “Delivery interrupted due to infrastructure maintenance. We’ve automatically credited $1,500 to your account balance. This credit is available immediately.”

Platform metrics displayed: “98.2% campaigns within ±1% (target 98%), 99.6% within ±2%, avg correction time 6 hours, 97% automated”.

Error Budget & Monitoring

Financial accuracy SLO: 98% campaigns ±1% (target, aligns with Fluency benchmark), 99.5% ±2% (tolerance), 99.9% ±5% (escalation), <0.1% exceed ±5% (breach).

Monitoring: Daily variance distribution (30-day window): campaign counts by variance tier (1%, 2%, 5%), avg variance, P95/P99.

Compensation metrics: Credits 2-3% daily (median $15-50), refunds 0.1-0.5%, manual review <0.05%, total cost 0.2-0.3% gross revenue.

ROI: $50K annual cost prevents $500K+ legal risk, saves $100K finance overhead, reduces 2-5% churn → 5-10× return.

Observability and Operations

Service Level Indicators and Objectives

Key SLIs:

Service	SLI	Target	Why
Ad API	Availability	99.9%	Revenue tied to successful serves
Ad API	Latency	p95 <150ms, p99 <200ms	Mobile timeouts above 200ms
ML	Accuracy	AUC >0.78	Below 0.75 = 15%+ revenue drop
RTB	Response Rate	>80% DSPs within 100ms	<80% = remove from rotation
Budget	Consistency	Over-delivery <1%	>2% = refunds, >5% = lawsuits

Error Budget Policy (99.9% = 43 min/month):

When budget exhausted:

Freeze feature launches (critical fixes only)
Focus on reliability work
Mandatory root cause analysis
Next month: 99.95% target to rebuild trust

Incident Response Dashboard

Effective incident response requires immediate access to:

SLO deviation metrics - Latency (p95, p99) and error rate vs targets to determine severity

Resource utilization - CPU/GPU/memory metrics plus active configuration (model versions, feature flags) to distinguish capacity from configuration issues

Dependency breakdown - Per-service latency (cache, database, ML, external APIs) to isolate the actual bottleneck

Historical patterns - Similar past incidents and time-series showing when degradation began

Distributed Tracing

Single user reports “ad not loading” among 1M+ req/sec:

Request ID: 7f3a8b2c… Total latency: 287ms (VIOLATED SLO)

Trace breakdown:

API Gateway: 2ms (normal)
User Profile: 45ms (normally 10ms - 4.5× slower)
- Redis: 43ms (normally 5ms)
  - TCP timeout: 38ms
  - Cause: Node failure, awaiting replica promotion
ML Inference: 156ms (normally 40ms - 3.9× slower)
- Batch incomplete: 8/32 requests
- Cause: Low traffic (Redis failure reduced overall QPS)
RTB: 84ms (normally 70ms - slightly elevated)

Root cause: Redis node failure → cascading slowdown. Trace shows exactly why.

Security and Compliance

Service-to-Service Authentication: Zero Trust with mTLS

In distributed systems with 50+ microservices, network perimeters are insufficient. Solution: mutual TLS (mTLS) via Istio service mesh.

Every service receives a unique X.509 certificate (24-hour TTL) from Istio CA via SPIFFE/SPIRE. Envoy sidecar proxies automatically handle certificate rotation, mutual authentication, and traffic encryption - transparent to application code. All plaintext connections are rejected.

Authorization policies enforce least-privilege access:

Ad Server → ML Inference: Allowed
Ad Server → Budget Database: Blocked (must use Atomic Pacing Service)
External DSPs → Internal Services: Blocked (terminate at gateway)

Defense in depth: Even if network segmentation fails, attackers cannot decrypt inter-service traffic, impersonate services, or call unauthorized endpoints.

PII Protection:

Encryption at rest: KMS-encrypted CockroachDB storage
Column-level encryption: Only ML pipeline has decrypt permission
Data minimization: Hashed user IDs, no email/name in ad requests
Log scrubbing: user_id=[REDACTED]

Secrets: Vault with Dynamic Credentials

Lease credentials auto-rotated every 24h
Audit log: which service accessed what when
Revoke access instantly if compromised

ML Data Poisoning Protection:

Training pipeline validates incoming events before model training:

CTR anomaly detection: Quarantine events with >3σ CTR spikes (e.g., 2%→8%)
IP entropy check: Flag low-diversity IP clusters (<2.0 entropy = botnet)
Temporal patterns: Detect uniform timing intervals (human=bursty, bot=mechanical)

Model integrity: GPG-signed models prevent loading tampered artifacts. Inference servers verify signatures before loading models, rejecting invalid signatures with immediate alerting.

Retention policies:

Data	Retention	Rationale
Raw events	7 days	Real-time only; archive to S3
Aggregated metrics	90 days	Dashboard queries
Model training data	30 days	Older data less predictive
User profiles	365 days	GDPR; inactive purged
Audit logs	7 years	Legal compliance

GDPR “Right to be Forgotten”:

Per GDPR Article 12, the platform must respond to erasure requests within one month (can be extended to three months for complex cases). Deletion is executed across 10+ systems in parallel:

CockroachDB: Delete user profile records
Redis/Valkey: Flush all user cache keys
Kafka: Publish tombstone events (triggers log compaction)
ML training: Mark user data as deleted
S3 cold archive: Mark for deletion (note: 7-year financial audit trails may be retained per legal basis override)
Backups: Crypto erasure (delete encryption key)

Verification: All systems confirm deletion completion → send deletion certificate to user within one month of request (target: 48-72 hours for standard cases).

Note on financial records: GDPR allows retention of financial transaction records beyond deletion requests when required by law (SOX, MiFID). User PII (name, email, demographics) is deleted, but anonymized transaction records ($X spent on date Y) are retained in S3 cold archive for regulatory compliance.

Production Operations at Scale

Deployment Safety and Zero-Downtime Operations

The availability imperative: With 99.9% SLO providing only 43 minutes/month error budget, we cannot afford to waste any portion on planned downtime. All deployments and schema changes must be zero-downtime operations.

Progressive deployment strategy:

Rolling deployments (canary → 10% → 50% → 100%) with automated gates on error rate, latency p99, and revenue metrics. Each phase must pass health checks before proceeding. Feature flags provide blast radius control - new features start dark, gradually enabled per user cohort.

Zero-downtime schema migrations:

Database schema changes consume zero availability budget through online DDL operations:

Simple changes (ADD COLUMN, CREATE INDEX): CockroachDB’s online schema changes with background backfill
Complex restructuring (partition changes): Dual-write pattern with gradual cutover (detailed in the Schema Evolution section below)
Validation: Shadow reads verify new schema correctness before cutover

The cost trade-off is clear: zero-downtime migrations require 2-4× more engineering effort than “take the system down” approaches, but protect against wasting the precious 43-minute availability budget on planned maintenance.

Key insight: The 43 minutes/month error budget is reserved for unplanned failures (infrastructure outages, cascading failures, external dependency failures). Planned operations (deployments, migrations, configuration changes) must never consume this budget.

Error Budgets: Balancing Velocity and Reliability

Error budgets formalize the trade-off between reliability and feature velocity. For a 99.9% availability SLO, the error budget is 43.2 minutes/month of unplanned downtime.

$$\text{Error Budget} = (1 - 0.999) \times 30 \times 24 \times 60 = 43.2 \text{ minutes/month}$$

Budget allocation strategy (unplanned failures only):

Source	Allocation	Rationale
Infrastructure failures	15 min (35%)	Cloud provider incidents, hardware failures, regional outages
Dependency failures	12 min (28%)	External DSP timeouts, third-party API issues
Code defects	8 min (19%)	Bugs escaping progressive rollout gates
Unknown/buffer	8 min (18%)	Unexpected failure modes, cascading failures

Note: Planned deployments and schema migrations target zero downtime through progressive rollouts and online DDL operations. When deployment-related issues occur (e.g., bad code pushed past canary gates), they count against “Code defects” budget.

Burn rate alerting:

Monitor how quickly budget is consumed. Burn rate = current error rate / target error rate. A 10× burn rate means exhausting the monthly budget in ~3 hours, triggering immediate on-call escalation.

Policy-driven decision making:

Error budget remaining drives release velocity:

>75% remaining: Ship aggressively, run experiments, test risky features
25-75% remaining: Normal operations, standard release cadence
<25% remaining: Freeze non-critical releases, focus on reliability
Exhausted: Code freeze except critical fixes, mandatory postmortems

Why 99.9% not 99.99%?

With zero-downtime deployments and migrations eliminating planned downtime, the 99.9% SLO (43 minutes/month) is entirely allocated to unplanned failures. Moving to 99.99% (4.3 minutes/month) would reduce our tolerance for unplanned failures from 43 to 4.3 minutes - a 10× tighter constraint.

This requires multi-region active-active with automatic failover (approximately doubling infrastructure costs) to achieve sub-minute recovery from regional outages. The economic question: is tolerating 39 fewer minutes of unplanned failures worth doubling infrastructure spend?

For advertising platforms with client-side retries and geographic distribution, the answer is no for most advertising platforms. Brief regional outages have limited revenue impact due to automatic retries and traffic redistribution. Better ROI comes from reducing MTTR (faster detection and recovery) than preventing all failures.

The tolerance for unplanned failures varies by domain - payment processing or healthcare systems require 99.99%+ because every transaction matters. Ad platforms operate at higher request volumes where statistical averaging and retries provide natural resilience.

Cost Management at Scale

Resource attribution with chargeback models (vCPU-hours, GPU-hours, storage IOPS per team). Standard optimizations: spot instances for training (70% cheaper), tiered storage, reserved capacity for baseline load. Track efficiency via vCPU-ms per request and investigate >15% month-over-month increases.

Resilience and Failure Scenarios

A robust architecture must survive catastrophic failures, security breaches, and business model pivots. This section addresses three critical scenarios:

Catastrophic Regional Failure: When an entire AWS region fails, our semi-automatic failover mechanism combines Route53 health checks (2-minute detection) with manual runbook execution to promote secondary regions. The critical challenge is budget counter consistency—asynchronous Redis replication creates potential overspend windows during failover. We mitigate this through pre-allocation patterns that limit blast radius to allocated quotas per ad server, bounded by replication lag multiplied by allocation size.

Malicious Insider Attack: Defense-in-depth through zero-trust architecture (SPIFFE/SPIRE for workload identity), mutual TLS between all services, and behavioral anomaly detection on budget operations. Critical financial operations like budget allocations require cryptographic signing with Kafka message authentication, creating an immutable audit trail. Lateral movement is constrained through Istio authorization policies enforcing least-privilege service mesh access.

Business Model Pivot to Guaranteed Inventory: Transitioning from auction-based to guaranteed delivery requires strong consistency for impression quotas. Rather than replacing our stack, we extend the existing pre-allocation pattern—CockroachDB maintains source-of-truth impression counters (leveraging the same HLC-based billing ledger) while Redis provides fast-path allocation with periodic reconciliation. This hybrid approach adds only 10-15ms to the critical path for guaranteed campaigns while preserving sub-millisecond performance for auction traffic. The 12-month evolution path reuses 80% of existing infrastructure (ML pipeline, feature store, Kafka, billing ledger) while adding campaign management and SLA tracking layers.

These scenarios validate that the architecture is not merely elegant on paper, but battle-hardened for production realities: regional disasters, adversarial threats, and fundamental business transformations.

Summary: Production Readiness Across All Dimensions

Production-grade distributed systems require more than elegant design—they demand operational rigor across eight critical dimensions. This post bridged the gap between architecture and reality by addressing how systems survive at 1M+ QPS under real-world conditions.

The eight pillars:

1. Fraud Detection - Multi-tier pattern detection (L1 Bloom filters at 0.5ms, L2 behavioral rules, L3 ML batch) catches 20-30% of bot traffic before expensive RTB calls, saving significant external DSP bandwidth costs.

2. Multi-Region Deployment - Active-active architecture across 3 AWS regions with semi-automatic failover (2min Route53 detection + manual runbook execution). Handles split-brain through pre-allocation patterns limiting overspend to <1% during replication lag windows.

3. Schema Evolution - Zero-downtime migrations using dual-write patterns and gradual cutover preserve 99.9% availability SLO. Trade 2-4× engineering effort for keeping 43min/month error budget available for unplanned failures.

4. Clock Synchronization - Hybrid Logical Clocks (HLC) in CockroachDB provide causally-consistent timestamps for financial ledgers without TrueTime hardware, ensuring regulatory compliance for audit trails.

5. Observability - SLO-based monitoring with 99.9% availability target (43min/month downtime budget). Burn rate alerting triggers paging at 10× consumption rate. Prometheus metrics, Jaeger traces (1% sampling), centralized logs.

6. Security & Compliance - Zero-trust architecture with mTLS service mesh (Istio), workload identity (SPIFFE/SPIRE), encryption at rest/transit, immutable audit logs. GDPR right-to-deletion via cascade deletes, CCPA data export on demand.

7. Production Operations - Progressive rollouts (1% → 10% → 50% → 100%) with automated gates checking error rates and latency. <5min rollback SLA from detection to restored service. Rolling updates with health checks and connection draining.

8. Resilience Validation - Tested scenarios: regional disasters (2-5min recovery with bounded overspend), malicious insiders (zero-trust prevention), business model pivots (80% infrastructure reuse for auction→guaranteed delivery transition).

Core insight: Operational excellence isn’t bolted on after launch—it must be designed into the architecture from day one. Circuit breakers, observability hooks, audit trails, multi-region replication, and progressive deployment are architectural requirements, not implementation details.

Next: Part 5 brings everything together with the complete technology stack—concrete choices, configurations, and integration patterns that transform abstract requirements into a deployable system.

Introduction: From Design to Production

Fraud Detection: Pattern-Based Abuse Detection

Critical Testing Requirements

Financial Accuracy Validation: Proving the ≤1% Budget Overspend Claim

Scale Validation: Performance Testing Beyond Unit Tests

Shadow Traffic: Financial System Validation Without User Impact

Shadow Traffic Flow Diagram

Section Conclusion

Multi-Region Deployment and Failover

Schema Evolution: Zero-Downtime Data Migration

Distributed Clock Synchronization and Time Consistency

Global Event Ordering for Financial Ledgers: The External Consistency Challenge

Financial Audit Log Reconciliation

Budget Reconciliation & Advertiser Compensation Workflow

Observability and Operations

Service Level Indicators and Objectives

Incident Response Dashboard

Distributed Tracing

Security and Compliance

Data Lifecycle and GDPR

Production Operations at Scale

Deployment Safety and Zero-Downtime Operations

Error Budgets: Balancing Velocity and Reliability

Cost Management at Scale

Resilience and Failure Scenarios

Summary: Production Readiness Across All Dimensions