Free lesson ยท 1 of 98 in the full path

Load Balancing 101

25 min read

The Sunday Night Zomato Incident

8:47 PM, Sunday. Half of India is ordering dinner. Zomato's traffic dashboard shows the usual Sunday surge.

Then an alert fires.

Server 1: CPU 97%, returning HTTP 503. Queue depth: 2,800 requests. Server 2: CPU 4%. Queue depth: 0 requests. Server 3: CPU 5%. Queue depth: 0 requests. Server 4: CPU 3%. Queue depth: 0 requests.

One server dying under load while three identical servers sit completely idle. The load balancer was misconfigured; it was sending every request to Server 1 because the health check weighting hadn't been updated after the last deployment.

This is the load balancing problem in its rawest form: it's not enough to have multiple servers. You need intelligence routing traffic between them.

Why This Matters

Every horizontally scaled application needs a load balancer. It's the mandatory infrastructure component the moment you run more than one server instance.

In system design interviews, load balancing questions surface constantly: "How does your API tier handle 10x traffic?" "What happens when one instance crashes?" "How do you do a zero-downtime deployment?" All of these answers go through load balancing.

๐ŸŸข The Simple Version (Start Here)

The Connaught Place Traffic Cop

Picture a traffic intersection in Connaught Place, Delhi during peak hours. Six lanes converge. Without a traffic cop, every driver pushes for the same gap, causing gridlock within minutes.

The traffic cop doesn't create new roads. They don't speed up cars. They simply observe which lanes are moving and direct cars accordingly. "Left lane is clear, go there. Centre lane is backed up, wait."

A load balancer is exactly this. Incoming requests are the cars. Server instances are the lanes. The load balancer observes server health and distributes requests so no one server gets overwhelmed while others sit idle.

The load balancer doesn't process requests itself; it just routes them. This means it must be extremely fast and highly available. If the load balancer goes down, nothing works. Managed load balancers (AWS ALB, GCP Load Balancer) solve the availability problem by running redundantly across multiple availability zones.

Clients millions of requests/sec Load Balancer health checks + routing Server 1 โ— healthy, 32% CPU Server 2 โ— healthy, 28% CPU Server 3 โ— healthy, 35% CPU Server 4 โœ— failed, removed Database shared state

A load balancer sits between clients and servers. It performs health checks: Server 4 failed and is removed from rotation. Traffic only routes to healthy servers (1, 2, 3). Servers are stateless; shared state lives in the database.

The Four Core Algorithms

How does a load balancer decide which server gets the next request? There are four main algorithms, each suited to different scenarios:

Round Robin Requests rotate evenly: Req 1 โ†’ Server A Req 2 โ†’ Server B Req 3 โ†’ Server C Req 4 โ†’ Server A โœ“ Simple, equal hardware โœ— Ignores server load Weighted RR By server capacity: Server A (w=5) โ†’ 50% Server B (w=3) โ†’ 30% Server C (w=2) โ†’ 20% โœ“ Heterogeneous servers โœ— Static, manual weights Least Connections Route to lowest active: Server A: 120 conns Server B: 45 conns โœ“ Server C: 80 conns โ†’ routes to Server B โœ“ Variable request times โœ— Slight overhead IP Hash Same IP โ†’ same server: hash(1.2.3.4) โ†’ Srv A hash(5.6.7.8) โ†’ Srv B 1.2.3.4 again โ†’ Srv A deterministic routing โœ“ Session stickiness โœ— Uneven if few IPs

The four standard load balancing algorithms. Round Robin suits homogeneous servers with uniform requests. Least Connections is best for variable-duration requests. IP Hash provides client affinity without session storage.

๐ŸŸก Going Deeper: L4 vs L7, Health Checks, and SSL

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the network stack, which fundamentally changes what routing decisions they can make:

Dimension L4 (Transport) L7 (Application)
Operates on IP + TCP/UDP HTTP headers, URL paths, cookies
Routing decisions IP address, port number /api/search โ†’ Search servers, /api/feed โ†’ Feed servers
Speed Faster (no payload inspection) Slightly slower (must parse HTTP)
SSL termination Passthrough only Yes, decrypts, inspects, re-encrypts
Sticky sessions IP-based only Cookie-based (more reliable)
Example AWS NLB, HAProxy TCP mode AWS ALB, Nginx, Cloudflare
When to use Raw TCP, non-HTTP, max performance REST APIs, microservices, path-based routing

For most web APIs, L7 is the right choice. Path-based routing (/payments โ†’ payment service, /orders โ†’ order service) is what makes microservices work cleanly.

Health Checks: How the Load Balancer Knows a Server is Dead

Every production load balancer runs continuous health checks. AWS ALB, for example, sends an HTTP GET to /health on every registered server every 30 seconds (configurable). If a server fails 2 consecutive checks (configurable), it's removed from rotation. No human intervention.

A good health check endpoint returns HTTP 200 only when the server is truly ready to accept requests, meaning it's connected to the database, the cache is warm, and the application has finished initializing. Many engineers return 200 immediately on startup, before the server is actually ready. This causes load balancers to send traffic to instances that aren't ready, resulting in a wave of errors.

Here is the whole story in motion: round-robin distributing traffic, Server B failing its health check, and the load balancer rerouting around the corpse without a human touching anything:

Load Balancing: Algorithms + Health Checks Load Balancer Server A Server B Server C ๐Ÿ’ฅ DOWN Health check fails โ†’LB stops sending traffic Traffic rebalanced to A & C Algorithms:- Round Robin- Least Connections- IP Hash- Weighted Round robin assumes all requests are equal. They never are. One heavy report request and round robin keeps stuffing that poor server.
Animated: round-robin until Server B fails its health check, then the load balancer stops sending it traffic and rebalances to A and C
100%
Load Balancing: Algorithms + Health Checks Load Balancer Server A Server B Server C ๐Ÿ’ฅ DOWN Health check fails โ†’LB stops sending traffic Traffic rebalanced to A & C Algorithms:- Round Robin- Least Connections- IP Hash- Weighted Round robin assumes all requests are equal. They never are. One heavy report request and round robin keeps stuffing that poor server.

Sticky Sessions: When You Need Client Affinity

Sometimes you want the same client to always hit the same server. For example, if the server is maintaining a WebSocket connection or a multi-step upload. This is called sticky sessions or session affinity.

AWS ALB implements this via a "AWSALB" cookie it sets on the first response. Subsequent requests from the same client include this cookie, and the load balancer routes to the same server.

The problem: sticky sessions partially defeat fault tolerance. If the server a user is stuck to goes down, that user's session is lost. The solution is to store session state externally (Redis) and use sticky sessions only when truly necessary (WebSockets, file uploads).

SSL Termination

A key feature of L7 load balancers: they can terminate SSL/TLS. The HTTPS connection from the client is decrypted at the load balancer. Traffic between the load balancer and backend servers can then use plain HTTP (within your private network).

Benefits: SSL certificates are managed in one place (the load balancer), not on every server. Backend servers do less cryptographic work. The load balancer can inspect HTTP headers and make routing decisions on the decrypted payload.

๐Ÿ”ด Architect's Corner: Global Load Balancing

Application load balancers handle traffic within a single region. At the global scale, when you have infrastructure in Mumbai, Singapore, and Frankfurt, you need a different layer: Global Load Balancing via DNS.

DNS-Based Load Balancing: GeoDNS

When a user in Chennai resolves api.swiggy.com, the DNS server can return the IP address of the Mumbai data center (closest). A user in the UK gets the Frankfurt data center IP. This is GeoDNS: routing at the DNS layer based on the requestor's geography.

AWS Route 53 implements this with Geolocation routing and Latency routing policies. When your api.example.com record is configured for latency-based routing, Route 53 measures actual network latency from the user's region and routes to the endpoint with lowest measured latency.

Multi-Region Failover

GeoDNS also enables cross-region failover. Route 53 health checks continuously probe your regional endpoints. If the Mumbai API goes down, Route 53 detects the failure within 30-60 seconds and automatically reroutes Indian users to Singapore.

This is how services like Swiggy, Razorpay, and CRED maintain availability during regional outages. No manual intervention; DNS TTL expires, new IP propagates, traffic flows to healthy region.

Consistent Hashing for Session Affinity

In large distributed systems, IP Hash breaks down when you add or remove servers because every IP remaps. Consistent hashing solves this: servers are arranged on a hash ring. Adding a server moves only a fraction of keys to the new server; removing a server moves only the dead server's keys to the next server on the ring.

Consistent hashing is used in distributed caches (Redis Cluster), Cassandra data distribution, and Varnish cache load balancing. In an interview, mentioning consistent hashing as an alternative to naive IP hash signals senior-level thinking.

Common Mistakes

"Round robin is good enough for production." For uniform requests on identical servers, yes. But if your API has a mix of /search (50ms average) and /upload (2,000ms average), round robin will pile up long requests on a few servers. Use Least Connections for heterogeneous workloads.

"The load balancer is infinitely scalable." Even AWS ALB has limits. Connections per second, bandwidth, and concurrent connections all have soft limits. For extreme traffic (IPL scale), load balancers need pre-warming or scaling announcements to AWS.

"Health checks at / are sufficient." A 200 on the root path means Nginx is running. It doesn't mean your app can connect to the database. Implement a /health endpoint that validates all critical dependencies.

"We don't need sticky sessions; we're using JWT." True for HTTP APIs. Not true for WebSocket connections, server-sent events, or stateful streaming; these require the connection to be maintained to the same server.

๐Ÿง  Key Takeaways

  • A load balancer distributes incoming requests across multiple servers, enabling horizontal scaling and fault tolerance.
  • Core algorithms: Round Robin (simple, equal servers), Weighted Round Robin (unequal capacity), Least Connections (variable request duration), IP Hash (client affinity).
  • L4 vs L7: L4 is faster but blind to HTTP. L7 enables path-based routing, cookie-based stickiness, and SSL termination.
  • Health checks: load balancers continuously probe servers and automatically remove unhealthy ones from rotation.
  • SSL termination at the load balancer reduces certificate management overhead and backend compute cost.
  • Global load balancing: DNS-based routing (GeoDNS) for multi-region deployments; consistent hashing for cache-aware routing.

Think About It

  1. Flipkart's sale day: 1,000 RPS normally, 12,000 RPS during the sale. You're using Round Robin across 3 servers. Some requests are product searches (20ms), some are order placement (800ms). What problem will you see, and what algorithm fixes it?

  2. Your health check endpoint returns 200 immediately after the process starts, before the database connection pool is initialized. Walk through exactly what goes wrong during a rolling deployment when the load balancer sends traffic too early.

  3. Paytm processes a payment in 3 steps over 1.5 seconds (auth โ†’ charge โ†’ confirm), each as a separate HTTP request. Why does this matter for load balancer configuration, and what's the safest option?

Further Reading

Quiz available inside the full course after you request access.