Free lesson ยท 1 of 98 in the full path
Load Balancing 101
25 min read
The Sunday Night Zomato Incident
8:47 PM, Sunday. Half of India is ordering dinner. Zomato's traffic dashboard shows the usual Sunday surge.
Then an alert fires.
Server 1: CPU 97%, returning HTTP 503. Queue depth: 2,800 requests. Server 2: CPU 4%. Queue depth: 0 requests. Server 3: CPU 5%. Queue depth: 0 requests. Server 4: CPU 3%. Queue depth: 0 requests.
One server dying under load while three identical servers sit completely idle. The load balancer was misconfigured; it was sending every request to Server 1 because the health check weighting hadn't been updated after the last deployment.
This is the load balancing problem in its rawest form: it's not enough to have multiple servers. You need intelligence routing traffic between them.
Why This Matters
Every horizontally scaled application needs a load balancer. It's the mandatory infrastructure component the moment you run more than one server instance.
In system design interviews, load balancing questions surface constantly: "How does your API tier handle 10x traffic?" "What happens when one instance crashes?" "How do you do a zero-downtime deployment?" All of these answers go through load balancing.
๐ข The Simple Version (Start Here)
The Connaught Place Traffic Cop
Picture a traffic intersection in Connaught Place, Delhi during peak hours. Six lanes converge. Without a traffic cop, every driver pushes for the same gap, causing gridlock within minutes.
The traffic cop doesn't create new roads. They don't speed up cars. They simply observe which lanes are moving and direct cars accordingly. "Left lane is clear, go there. Centre lane is backed up, wait."
A load balancer is exactly this. Incoming requests are the cars. Server instances are the lanes. The load balancer observes server health and distributes requests so no one server gets overwhelmed while others sit idle.
The load balancer doesn't process requests itself; it just routes them. This means it must be extremely fast and highly available. If the load balancer goes down, nothing works. Managed load balancers (AWS ALB, GCP Load Balancer) solve the availability problem by running redundantly across multiple availability zones.
A load balancer sits between clients and servers. It performs health checks: Server 4 failed and is removed from rotation. Traffic only routes to healthy servers (1, 2, 3). Servers are stateless; shared state lives in the database.
The Four Core Algorithms
How does a load balancer decide which server gets the next request? There are four main algorithms, each suited to different scenarios:
The four standard load balancing algorithms. Round Robin suits homogeneous servers with uniform requests. Least Connections is best for variable-duration requests. IP Hash provides client affinity without session storage.
๐ก Going Deeper: L4 vs L7, Health Checks, and SSL
Layer 4 vs Layer 7 Load Balancing
Load balancers operate at different layers of the network stack, which fundamentally changes what routing decisions they can make:
| Dimension | L4 (Transport) | L7 (Application) |
|---|---|---|
| Operates on | IP + TCP/UDP | HTTP headers, URL paths, cookies |
| Routing decisions | IP address, port number | /api/search โ Search servers, /api/feed โ Feed servers |
| Speed | Faster (no payload inspection) | Slightly slower (must parse HTTP) |
| SSL termination | Passthrough only | Yes, decrypts, inspects, re-encrypts |
| Sticky sessions | IP-based only | Cookie-based (more reliable) |
| Example | AWS NLB, HAProxy TCP mode | AWS ALB, Nginx, Cloudflare |
| When to use | Raw TCP, non-HTTP, max performance | REST APIs, microservices, path-based routing |
For most web APIs, L7 is the right choice. Path-based routing (/payments โ payment service, /orders โ order service) is what makes microservices work cleanly.
Health Checks: How the Load Balancer Knows a Server is Dead
Every production load balancer runs continuous health checks. AWS ALB, for example, sends an HTTP GET to /health on every registered server every 30 seconds (configurable). If a server fails 2 consecutive checks (configurable), it's removed from rotation. No human intervention.
A good health check endpoint returns HTTP 200 only when the server is truly ready to accept requests, meaning it's connected to the database, the cache is warm, and the application has finished initializing. Many engineers return 200 immediately on startup, before the server is actually ready. This causes load balancers to send traffic to instances that aren't ready, resulting in a wave of errors.
Here is the whole story in motion: round-robin distributing traffic, Server B failing its health check, and the load balancer rerouting around the corpse without a human touching anything:
Sticky Sessions: When You Need Client Affinity
Sometimes you want the same client to always hit the same server. For example, if the server is maintaining a WebSocket connection or a multi-step upload. This is called sticky sessions or session affinity.
AWS ALB implements this via a "AWSALB" cookie it sets on the first response. Subsequent requests from the same client include this cookie, and the load balancer routes to the same server.
The problem: sticky sessions partially defeat fault tolerance. If the server a user is stuck to goes down, that user's session is lost. The solution is to store session state externally (Redis) and use sticky sessions only when truly necessary (WebSockets, file uploads).
SSL Termination
A key feature of L7 load balancers: they can terminate SSL/TLS. The HTTPS connection from the client is decrypted at the load balancer. Traffic between the load balancer and backend servers can then use plain HTTP (within your private network).
Benefits: SSL certificates are managed in one place (the load balancer), not on every server. Backend servers do less cryptographic work. The load balancer can inspect HTTP headers and make routing decisions on the decrypted payload.
๐ด Architect's Corner: Global Load Balancing
Application load balancers handle traffic within a single region. At the global scale, when you have infrastructure in Mumbai, Singapore, and Frankfurt, you need a different layer: Global Load Balancing via DNS.
DNS-Based Load Balancing: GeoDNS
When a user in Chennai resolves api.swiggy.com, the DNS server can return the IP address of the Mumbai data center (closest). A user in the UK gets the Frankfurt data center IP. This is GeoDNS: routing at the DNS layer based on the requestor's geography.
AWS Route 53 implements this with Geolocation routing and Latency routing policies. When your api.example.com record is configured for latency-based routing, Route 53 measures actual network latency from the user's region and routes to the endpoint with lowest measured latency.
Multi-Region Failover
GeoDNS also enables cross-region failover. Route 53 health checks continuously probe your regional endpoints. If the Mumbai API goes down, Route 53 detects the failure within 30-60 seconds and automatically reroutes Indian users to Singapore.
This is how services like Swiggy, Razorpay, and CRED maintain availability during regional outages. No manual intervention; DNS TTL expires, new IP propagates, traffic flows to healthy region.
Consistent Hashing for Session Affinity
In large distributed systems, IP Hash breaks down when you add or remove servers because every IP remaps. Consistent hashing solves this: servers are arranged on a hash ring. Adding a server moves only a fraction of keys to the new server; removing a server moves only the dead server's keys to the next server on the ring.
Consistent hashing is used in distributed caches (Redis Cluster), Cassandra data distribution, and Varnish cache load balancing. In an interview, mentioning consistent hashing as an alternative to naive IP hash signals senior-level thinking.
Common Mistakes
"Round robin is good enough for production."
For uniform requests on identical servers, yes. But if your API has a mix of /search (50ms average) and /upload (2,000ms average), round robin will pile up long requests on a few servers. Use Least Connections for heterogeneous workloads.
"The load balancer is infinitely scalable." Even AWS ALB has limits. Connections per second, bandwidth, and concurrent connections all have soft limits. For extreme traffic (IPL scale), load balancers need pre-warming or scaling announcements to AWS.
"Health checks at / are sufficient."
A 200 on the root path means Nginx is running. It doesn't mean your app can connect to the database. Implement a /health endpoint that validates all critical dependencies.
"We don't need sticky sessions; we're using JWT." True for HTTP APIs. Not true for WebSocket connections, server-sent events, or stateful streaming; these require the connection to be maintained to the same server.
๐ง Key Takeaways
- A load balancer distributes incoming requests across multiple servers, enabling horizontal scaling and fault tolerance.
- Core algorithms: Round Robin (simple, equal servers), Weighted Round Robin (unequal capacity), Least Connections (variable request duration), IP Hash (client affinity).
- L4 vs L7: L4 is faster but blind to HTTP. L7 enables path-based routing, cookie-based stickiness, and SSL termination.
- Health checks: load balancers continuously probe servers and automatically remove unhealthy ones from rotation.
- SSL termination at the load balancer reduces certificate management overhead and backend compute cost.
- Global load balancing: DNS-based routing (GeoDNS) for multi-region deployments; consistent hashing for cache-aware routing.
Think About It
Flipkart's sale day: 1,000 RPS normally, 12,000 RPS during the sale. You're using Round Robin across 3 servers. Some requests are product searches (20ms), some are order placement (800ms). What problem will you see, and what algorithm fixes it?
Your health check endpoint returns 200 immediately after the process starts, before the database connection pool is initialized. Walk through exactly what goes wrong during a rolling deployment when the load balancer sends traffic too early.
Paytm processes a payment in 3 steps over 1.5 seconds (auth โ charge โ confirm), each as a separate HTTP request. Why does this matter for load balancer configuration, and what's the safest option?
Further Reading
- AWS ALB Documentation: How Request Routing Works: The definitive reference for L7 load balancing on AWS
- HAProxy Configuration Guide: Deep dive into algorithm configuration, health checks, and tuning
- Nginx Load Balancing: Practical configuration examples for all algorithms
- Consistent Hashing: Algorithmic Foundations: Why consistent hashing beats modular hashing in distributed systems
Quiz available inside the full course after you request access.