System Design Masterclass (Golang)

Answer-first: Optimal system design requires continuously balancing latency, throughput, consistency, and availability — each technical decision carries trade-offs. This series delivers deep architectural analysis, rigorous trade-off evaluation, and production-grade Go implementations for engineers building high-scale distributed systems.


[!NOTE] This series is designed for Senior Backend Engineers & Architects. We skip definitions and go straight to the technical core: formal theorem proofs, production case studies, and compilable Go code patterns used at companies like Shopee, Alipay, and PayPay.


📚 Series Syllabus

Tier 1: Core Patterns & Production Readiness

Master the foundational design patterns for optimizing individual services and storage layers.

  1. System Design Thinking & Trade-offs — CAP, PACELC & Clean Architecture

    • Formal CAP theorem proof (Gilbert & Lynch), PACELC database classification matrix, composite availability math.
    • Clean Architecture with Dependency Inversion in Go: Port/Adapter pattern with interface-driven testing.
  2. Load Balancing L4/L7 & Rate Limiting — DSR, API Gateway & Token Bucket

    • L4 vs L7 routing internals, Direct Server Return with HAProxy + Linux sysctl configuration.
    • Token Bucket rate limiting middleware in Go using golang.org/x/time/rate with per-client limiters.
  3. Caching Strategies & Cache Stampede — Singleflight, XFetch & Redis LFU

    • Write-Through vs Write-Behind vs Cache-Aside trade-off matrix with latency and data-loss analysis.
    • XFetch probabilistic early expiration (math + Go implementation), singleflight deduplication, tiered cache.
  4. Database Scaling & Connection Pool Tuning — Sharding, TiDB & PostgreSQL

    • B-Tree vs LSM-Tree storage engine internals, Range/Hash/Directory sharding strategies.
    • TiDB Percolator distributed 2PC, PostgreSQL 5–10 MB/connection overhead, database/sql pool tuning.
  5. Event-Driven Architecture & Kafka — Worker Pool, Backpressure & Exactly-Once

    • Kafka zero-copy sendfile() internals, sparse index lookup mechanism, Kafka vs RabbitMQ decision matrix.
    • Bounded Worker Pool with natural backpressure via channels, partition-aware ordered processing.

Tier 2: Advanced Reliability & Distributed Systems

Solve the hard problems that emerge when operating multi-service distributed systems at scale.

  1. Distributed Locks — Redlock Math, etcd Raft & Split-Brain Prevention

    • Redlock MIN_VALIDITY formula with clock drift math, step-by-step algorithm with mermaid flowchart.
    • Redis (AP) vs etcd (CP/Raft) decision matrix, redsync and etcd lease-based Go implementations.
  2. Idempotent API Design — Idempotency Key, SetNX Middleware & Stripe Pattern

    • Full HTTP response recorder middleware, payload hash for key-reuse detection, DB fallback schema.
    • 100-goroutine concurrent test proving mutual exclusion, exponential backoff with jitter formula.
  3. Saga Pattern & Distributed Transactions — Temporal, Outbox & Debezium

    • 2PC failure modes, Saga vs 2PC comparison, Orchestration vs Choreography trade-offs.
    • Temporal Go SDK with LIFO compensating transactions, Transactional Outbox, Debezium EventRouter config.
  4. Consistent Hashing — Virtual Nodes, Load Variance & CRC32 Ring in Go

    • Why modulo hashing fails at scale, virtual node standard deviation analysis (V=1 to V=1000 table).
    • Thread-safe CRC32 hash ring with sync.RWMutex, GetN replication, Redis Cluster hash slot routing.
  5. Observability & pprof — Memory Leak Diagnosis, CPU Profiling & GODEBUG

    • Six pprof endpoint grid with overhead percentages, inuse_space vs alloc_space decision guide.
    • 5-step heap diff memory leak diagnosis, goroutine leak detection, GODEBUG=gctrace=1 parsing.
  6. Security & API Rate Limiting — Token Bucket, Leaky Bucket & Redis Lua

    • WAF vs L7 API Gateway vs Application rate limiting, preventing client IP spoofing via PROXY protocol.
    • Local rate limiter lock contention mitigations, and production-ready Redis Lua sliding window script.
  7. Communication Protocols — gRPC vs REST vs GraphQL in Go Microservices

    • Serialization benchmarks (JSON vs Protobuf), Protobuf wire format encoding, and HTTP/3 QUIC stream transport.
    • GraphQL gateway complexity control formulas, ConnectRPC cleartext integration, and in-memory bufconn testing.

🏛️ Tier 3: Real-World Case Studies

Learn from the world’s most demanding distributed systems to understand how theory applies at extreme scale.


👉 Hire for architecture consulting if you need to solve scale challenges, optimize database performance, or design concurrency-safe systems for your organization.

gRPC vs REST vs GraphQL: Communication Protocols in Go

Prerequisite: This is Part 12 of the System Design Masterclass. Previous parts built the reliability patterns — this part covers comparing communication protocols and data formats for microservice communication. Answer-first: gRPC is optimized for internal microservices using binary Protobuf serialization over multiplexed HTTP/2 or HTTP/3 streams. REST uses standard JSON over HTTP/1.1 or HTTP/2, serving as the default for public APIs. GraphQL operates as an aggregator at the API gateway or Backend-for-Frontend (BFF) layer, allowing clients to query specific properties, but requires complexity limits and DataLoader batching to prevent server degradation. ...

June 18, 2026 · 10 min · Tanh

Go API Rate Limiting: Token Bucket & Redis Lua

Prerequisite: This is Part 11 of the System Design Masterclass. Previous parts built the core components — this part covers securing APIs and managing client traffic spikes at scale. Answer-first: API rate limiting defends backend services by restricting request volume. Security requires a layered defense: Web Application Firewalls (WAF) block edge-level volumetric spikes, API Gateways manage L7 credentials and quotas, and application middleware enforces fine-grained business limits. Client identification must rely on validated, secure IP parsing (using the PROXY protocol or rightmost X-Forwarded-For checks). ...

June 18, 2026 · 9 min · Tanh

Go Observability & pprof — Memory Leaks, CPU Profiling & GODEBUG

Prerequisite: This is Part 10 of the System Design Masterclass. Previous parts built the architecture — this part teaches you how to see inside a running system and diagnose production performance issues. Answer-first: Go’s built-in pprof profiler provides CPU sampling, heap allocation analysis, goroutine stack inspection, and blocking profiler — all available as HTTP endpoints in running production services with minimal overhead. Heap diff between two snapshots is the fastest way to identify memory leaks. ...

June 18, 2026 · 9 min · Tanh

Consistent Hashing in Go — Virtual Nodes & CRC32 Ring

Prerequisite: Part 9 of the System Design Masterclass. Read Part 4: Database Scaling for context on horizontal partitioning strategies. Answer-first: Consistent Hashing minimizes key remapping when cluster membership changes. Adding or removing one node from a modulo-hash cluster remaps nearly all keys (catastrophic cache miss storm). Consistent Hashing remaps only $K/N$ keys — the theoretical minimum necessary. Why Modulo Hashing Fails When Scaling Answer-first: hash(key) % N changes to hash(key) % (N+1) when a node is added, causing nearly all key-to-node mappings to change. This creates a massive cache miss storm as the entire working set must be reloaded from the database simultaneously. ...

June 18, 2026 · 8 min · Tanh

Saga Pattern in Go — Temporal, Outbox Pattern & Debezium

Prerequisite: Part 8 of the System Design Masterclass. Read Part 7: Idempotent API Design first — compensating transactions in Saga must be idempotent. Answer-first: The Saga Pattern coordinates distributed transactions across microservices by decomposing a large transaction into a sequence of local transactions. If any step fails, the system automatically executes compensating transactions in reverse order to undo completed steps. Each local transaction must be idempotent. What Are the Problems with 2PC in Microservices? Answer-first: Two-Phase Commit (2PC) is a blocking protocol with a coordinator single point of failure. If the coordinator crashes between the Prepare and Commit phases, all participants are blocked indefinitely with locks held — a catastrophic failure mode in microservices. These are the same core banking distributed transaction challenges seen in legacy systems. ...

June 18, 2026 · 8 min · Tanh

Idempotent API Design in Go — Idempotency Key & Redis SetNX

Prerequisite: Part 7 of the System Design Masterclass. Read Part 6: Distributed Locks — concurrent duplicate request blocking relies on the same mutual exclusion primitives. Answer-first: API idempotency ensures that retrying an identical request (same Idempotency-Key) never produces additional side effects beyond the first execution. This is foundational for payment APIs where network timeouts force client retries, and a duplicate execution would mean a double charge. What Is an Idempotency Key? Answer-first: An Idempotency Key is a unique token — typically UUID v4 — generated by the client and attached as an Idempotency-Key HTTP header. The server uses this key to detect duplicate requests: if the key has been seen before, return the cached response from the first execution without re-executing the handler. ...

June 18, 2026 · 8 min · Tanh

Distributed Locks in Go — Redlock Math, etcd & Split-Brain

Prerequisite: Part 6 of the System Design Masterclass. Read Part 5: Kafka & Event-Driven to understand event sourcing patterns before tackling lock coordination. Answer-first: Distributed locks solve the mutual exclusion problem across independent servers — ensuring only one server can modify a shared resource at a time. Redis Redlock provides high-performance locking using majority quorum across multiple master nodes; etcd provides stronger guarantees via Raft consensus at the cost of higher latency. ...

June 18, 2026 · 8 min · Tanh

Kafka Worker Pool in Go — Backpressure & Exactly-Once

Prerequisite: Part 5 of the System Design Masterclass. Read Part 4: Database Scaling to understand the storage tier that persisted events are written to. Answer-first: Event-Driven Architecture decouples services through asynchronous communication via a durable message log. In Go, goroutines and buffered channels implement natural backpressure — when consumers fall behind producers, the channel fills up and blocks the producer, throttling the ingest rate automatically. Kafka vs RabbitMQ — When to Use Each? Answer-first: Kafka is a distributed commit log — messages are retained indefinitely, consumers manage their own offsets, and replay is possible. RabbitMQ is a message broker — messages are deleted after acknowledgment, the broker handles routing complexity, push-based delivery. They solve different problems. ...

June 18, 2026 · 8 min · Tanh

Database Sharding in Go — TiDB, PostgreSQL & Connection Pools

Prerequisite: Part 4 of the System Design Masterclass. Read Part 3: Caching Strategies to understand the cache layer before examining storage. Answer-first: Database sharding distributes data horizontally across independent partitions (shards) based on a shard key, reducing write contention and enabling linear storage growth. Choosing the wrong shard key leads to hot spots that can be worse than no sharding at all. Vertical vs Horizontal Scaling — When to Switch? Answer-first: Vertical scaling (scale-up) increases resources on a single server — simple but has a hard physical ceiling and non-linear cost growth. Horizontal scaling (scale-out) adds more servers — no theoretical ceiling, linear cost, but significantly higher operational complexity. ...

June 18, 2026 · 8 min · Tanh

Caching Strategies in Go — Cache Stampede, XFetch & Redis LFU

Prerequisite: Part 3 of the System Design Masterclass. Read Part 2: Load Balancing L4/L7 to understand the traffic layer before diving into the caching tier. Answer-first: Effective caching strategy selection hinges on the acceptable consistency window and the read/write access pattern of the workload. Write-Through suits financial records; Write-Behind suits analytics and event counters; Cache-Aside is the default for read-heavy API responses. How Does Cache Stampede Happen? Answer-first: Cache Stampede (thundering herd) occurs when a popular cached key expires and multiple concurrent goroutines simultaneously detect a cache miss — then all query the database simultaneously. The burst of duplicate DB queries can exceed connection pool capacity and cause cascading failure. ...

June 18, 2026 · 9 min · Tanh

Load Balancing L4/L7 in Go — DSR, Rate Limiting & API Gateway

Prerequisite: Part 2 of the System Design Masterclass. Read Part 1: System Design Thinking first to understand foundational trade-off frameworks. Answer-first: L4 load balancing routes traffic by transport-layer (IP/TCP/UDP) metadata — minimal CPU overhead but limited intelligence. L7 load balancing inspects HTTP headers, paths, and cookies — enables content-based routing and advanced health checks at the cost of higher processing overhead per request. L4 vs L7 Load Balancing — The Definitive Comparison Answer-first: The fundamental difference is where in the network stack the routing decision is made. L4 (Transport Layer) routes at TCP/UDP level using IP+port tuples. L7 (Application Layer) routes at HTTP level using headers, URLs, and payloads. ...

June 18, 2026 · 9 min · Tanh

Go System Design: CAP, PACELC & Clean Architecture Primer

Prerequisite: This is Part 1 of the System Design Masterclass series. Familiarity with basic distributed systems concepts and Go syntax is assumed. Answer-first: Sound system design thinking is fundamentally about evaluating and selecting trade-offs across performance, reliability, and cost. No system is perfect — architects optimize for the constraints imposed by real business requirements and technical realities. How Do You Build System Design Thinking? Answer-first: System design mastery is built on three pillars: mastering foundational theorems (CAP, PACELC), practicing trade-off analysis on real-world case studies, and repeatedly decomposing large problems into measurable, independently scalable components. ...

June 18, 2026 · 9 min · Tanh