Free lesson ยท 1 of 98 in the full path

Latency vs Throughput vs Bandwidth

25 min read

"Just Buy More Speed"

A SaaS startup in Pune. Users in the US are complaining that the dashboard feels slow. Every click takes a second to respond. The engineering manager does the obvious thing: upgrades the servers to double the cores, doubles the network plan, and proudly announces the fix.

Nothing changes. Clicks still take a second.

Two weeks and several lakhs later, a junior engineer asks one question in standup: "Where are the servers?" Mumbai. "Where are the users?" Boston. The round trip between Boston and Mumbai is around 200 milliseconds, and the app makes four sequential API calls per click. Four times 200ms is 800ms of pure travel time. No amount of CPU or bandwidth was ever going to fix that. The fix was moving static assets to a CDN and collapsing four calls into one. Total infra cost: less than the celebration pizza from the failed upgrade.

When a junior developer says a system is slow, they reach for the word "speed". In system design, "speed" is a useless blanket word. It hides the actual physics. An architect breaks it into three separate metrics: latency, throughput, and bandwidth. Fix the wrong one and you burn money while the system stays slow.

Why Should You Care?

  1. Misdiagnosis is expensive. Buying bandwidth to fix a latency problem is like adding lanes to fix a traffic light. I have watched companies do exactly this. The bill arrives, the slowness stays.
  2. Interviews test the vocabulary directly. "Your API is slow, walk me through your investigation" is a classic prompt. The interviewer is listening for whether you separate the three metrics or wave the word "performance" around.
  3. Every SLA you will ever sign is written in these units. p99 latency, requests per second, gigabits per second. If you cannot read these numbers fluently, you cannot negotiate them.

๐ŸŸข The Simple Version

Latency vs Throughput vs Bandwidth (Highway Analogy) Bandwidth = Lanes (Capacity) Throughput = Cars passing per hour TollBooth Processing choke point Latency = Time from Pune to Mumbai Tolls (processing time) matter more than lanes after a certain point. Don't just add lanes.
Latency vs throughput vs bandwidth explained with the highway analogy
100%
Latency vs Throughput vs Bandwidth (Highway Analogy) Bandwidth = Lanes (Capacity) Throughput = Cars passing per hour TollBooth Processing choke point Latency = Time from Pune to Mumbai Tolls (processing time) matter more than lanes after a certain point. Don't just add lanes.

We will use the Mumbai-Pune Expressway. It maps perfectly.

Bandwidth: The Lanes

Bandwidth is the maximum theoretical capacity of a connection. How much data can physically travel through the pipe at the same moment.

On the expressway, bandwidth is the number of lanes. A 6-lane highway holds more cars side by side than a 2-lane road. When your ISP sells you a "100 Mbps connection", they are selling you the width of the highway. Nothing more.

Measured in: Mbps, Gbps.

Throughput: The Cars Actually Passing

Throughput is reality. The rate of successful work your system actually achieves in production.

Picture a glorious 10-lane highway with one toll booth in the middle, staffed by one sleepy operator. The highway could carry 10,000 cars an hour. The toll booth processes 100. Traffic backs up for kilometres. Your throughput is 100 cars an hour, and the 10 lanes are decoration.

In software, the toll booth is your CPU struggling with SSL, a database holding a row lock, a thread pool that ran out of threads. The golden rule: throughput is always less than or equal to bandwidth. You can have a 10 Gbps pipe and 10 Mbps of throughput if your server is choking.

Measured in: requests per second (RPS), MB/s.

Latency: The Travel Time

Latency is how long one single request takes to go from the client to the server and back. The silent killer, because no amount of capacity fixes it.

On the expressway, latency is the time one specific car takes from Pune to Mumbai. Build a 100-lane highway with zero toll booths and that car still takes its two and a half hours, because the cities are physically apart and there is a speed limit.

In networking, the speed limit is the speed of light in fiber. Bangalore to Virginia and back is roughly 200ms. You cannot upgrade your way past physics. The only fix is reducing the distance (CDN, edge servers, regional deployment) or making fewer trips (batching, caching).

Measured in: milliseconds.

The One-Line Summary

Bandwidth is what you could carry. Throughput is what you actually carry. Latency is how long one item takes. Three dials, three different problems, three different fixes.

๐ŸŸก Going Deeper

Averages Lie. Percentiles Tell the Truth.

Here is a mistake I see even from senior engineers: reporting average latency. Your average is 95ms, the dashboard is green, and meanwhile a chunk of your users are suffering.

Averages Lie. Percentiles Tell the Truth. Same service. Same hour. One "latency". p50 = 80ms half the users. feels instant. p90 = 200ms 1 in 10 requests. noticeable. p99 = 2,000ms 1 in 100 requests. your angry-user thread on Twitter lives here. And the dashboard? "average = 95ms". Green. Useless. A page making 30 backend calls hits at least one p99 about 26% of the time. Set SLOs on percentiles, never on averages.
Why average latency lies: the p50, p90 and p99 of the same system tell very different stories
100%
Averages Lie. Percentiles Tell the Truth. Same service. Same hour. One "latency". p50 = 80ms half the users. feels instant. p90 = 200ms 1 in 10 requests. noticeable. p99 = 2,000ms 1 in 100 requests. your angry-user thread on Twitter lives here. And the dashboard? "average = 95ms". Green. Useless. A page making 30 backend calls hits at least one p99 about 26% of the time. Set SLOs on percentiles, never on averages.

Latency is never one number. It is a distribution, and the distribution has a long ugly tail:

  • p50 (median): half your requests are faster than this. Say 80ms. Feels great.
  • p90: 1 in 10 requests is slower than this. Say 200ms. Hmm.
  • p99: 1 in 100 requests is slower than this. Say 2 seconds. There is your angry-user thread on Twitter.

One slow request in a hundred sounds rare. Now do the math for a page that makes 30 backend calls. The probability that at least one of them hits the p99 is about 26%. A quarter of your page loads eat the tail. This is why serious teams obsess over p99 and p999, and why Amazon and Google write entire papers about tail latency.

The practical rule: set SLOs on percentiles, never on averages. "p99 under 300ms" is a real promise. "Average under 100ms" is a number that stays green while users suffer.

The Toll Booth Saturates: How Throughput Collapses

Throughput problems have a personality. Everything is fine, fine, fine, and then suddenly it is very much not fine.

Throughput Collapse: When Arrivals Beat the Booth Arrivals: 101 cars/hour Toll booth100 cars/hour served ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— the queue grows without limit. every car waits longer than the one before. Systems do not degrade linearly near the limit. At 70% utilization life is fine, at 95% queues build, past 100% latency goes vertical. The queue IS the latency. Plan for 60-70%.
Throughput collapse at the choke point: when arrival rate crosses service rate, the queue grows without limit
100%
Throughput Collapse: When Arrivals Beat the Booth Arrivals: 101 cars/hour Toll booth100 cars/hour served ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— ๐Ÿš— the queue grows without limit. every car waits longer than the one before. Systems do not degrade linearly near the limit. At 70% utilization life is fine, at 95% queues build, past 100% latency goes vertical. The queue IS the latency. Plan for 60-70%.

The toll booth processes 100 cars an hour. At 80 cars an hour arriving, no queue, everyone is happy. At 99, small queue, slightly slow. At 101? The queue grows forever. Every car that arrives waits longer than the one before it. Latency shoots from seconds to hours, not because the booth got slower, but because arrivals crossed the service rate.

Queueing theory calls this saturation, and the lesson for architects is blunt: systems do not degrade linearly near their limit. A server at 70% utilization responds nicely. At 95%, the queues are building and latency is already climbing. Past 100%, response times go vertical and timeouts cascade. This is why capacity planning targets 60 to 70% utilization, not 95%. The headroom is not waste. It is the difference between a spike being absorbed and a spike becoming an outage.

This is also the link between the two metrics people confuse: when throughput saturates, latency explodes. The queue IS the latency. So a latency spike at 6 PM daily is usually not a latency problem at all. It is a throughput problem wearing latency's clothes.

The Numbers an Architect Carries in Their Head

You do not need precision. You need the orders of magnitude, so that when someone proposes a design you can smell whether it is physically possible.

Operation Rough time
L1 cache reference ~1 ns
RAM read ~100 ns
SSD random read ~100 ยตs
Read 1 MB sequentially from SSD ~1 ms
Same data center round trip ~0.5 ms
Mumbai to Delhi round trip ~30 ms
Mumbai to Singapore round trip ~70 ms
Mumbai to US East round trip ~200 ms
One database query (indexed, warm) ~1 to 5 ms
One database query (unindexed scan, large table) seconds

Two takeaways from this table. First, memory is thousands of times faster than disk, and disk is hundreds of times faster than crossing an ocean. That gradient is why caching works at every layer. Second, your latency budget gets eaten by round trips, not computation. Four sequential API calls across regions cost more than almost any code you could write.

// The same feature, two designs, one ocean apart.
//
// Design A: chatty. 4 sequential calls, client in Boston, API in Mumbai
//   4 x 200ms round trips = 800ms before any work happens
//
// Design B: one batched call + CDN for static assets
//   1 x 200ms round trip = 200ms
//
// No server upgrade can close a 600ms gap created by chattiness.
// Fewer trips beats faster servers, every single time.

๐Ÿ”ด Architect's Corner

The Tail at Scale Problem

The percentile math gets worse with fan-out, and fan-out is how modern systems are built. A Swiggy home screen calls maybe a dozen services: restaurants, offers, ads, ETA, past orders. The page is only as fast as its slowest call. Even if every single service has a beautiful p99, the page's p99 is dramatically worse, because the more dice you roll, the more often one comes up bad.

The standard defenses, which you will meet properly in later phases:

  • Timeouts with budgets. Give the offers service 80ms. If it does not answer, render the page without offers. A page missing one rail beats a page that is slow.
  • Hedged requests. Send the same read to two replicas, take the first answer. Costs extra load, buys tail latency. Google does this for search.
  • Caching the slow path. The tail usually comes from cold caches and GC pauses. Pre-computing the expensive parts moves the tail.

Latency Is a Product Decision, Not Just an Engineering One

Different operations deserve different latency budgets, and pretending otherwise wastes money:

  • Search-as-you-type: under 100ms, or it feels broken. Users notice every keystroke.
  • Page load: under 1 second feels instant, under 3 seconds is tolerable. Amazon famously measured revenue loss per 100ms of delay.
  • Payment confirmation: 2 to 3 seconds is fine. Users expect money to take a moment. Honestly, an instant payment confirmation makes some users nervous.
  • Report generation: nobody cares if it takes 30 seconds, as long as you show progress and deliver it reliably.

The architect's move is to put numbers on these budgets during requirements, not after launch. "How fast does this need to be?" is a cheaper question than a redesign.

Bandwidth Problems Are Rare. Except When They Are Everything.

For typical API traffic (small JSON payloads), bandwidth is almost never the bottleneck. A 1 Gbps link carries a comical number of 2KB responses per second. So when an engineer blames bandwidth for a slow CRUD app, be suspicious.

But three workloads are genuinely bandwidth-bound, and India has world-class examples of each: video streaming (Hotstar shifting terabits during an IPL final, which is why CDNs and adaptive bitrate exist), backups and data migration (moving a 50TB database across regions is a bandwidth and time problem, plan it in days), and ML training pipelines (shuffling training data between storage and GPUs). For these, bandwidth math comes first in the design, not last.

Where the Milliseconds Actually Go

When someone says "the API is slow", the time is hiding in one of five places. Check them in this order, cheapest first:

  1. The network path: how many round trips, across what distance? (Fix: CDN, batching, regional deployment, connection reuse.)
  2. The queue: is the service saturated and queueing requests? (Fix: more capacity or less work per request. Look at utilization, not code.)
  3. The database: missing index, lock contention, N+1 queries? (Fix: EXPLAIN plan, in Phase 5.)
  4. The dependencies: which downstream call is dragging the tail? (Fix: timeouts, caching, parallel calls instead of sequential.)
  5. The code: actual computation. (Genuinely the bottleneck less often than developers hope.)

Most "we need a faster server" conversations end at item 1 or 2.

The Decision Matrix

Symptom Likely metric First fix to try
Large file downloads crawl, small requests are fine Bandwidth Bigger pipes, CDN for static content, compression
System chokes when 5,000 users arrive at once Throughput Find the toll booth: locks, pools, CPU. Add capacity or cut work per request
Every click feels sluggish, even for one user Latency Count round trips and distance. CDN, batching, move compute closer
Fine all day, dies at 6 PM Throughput (saturation) Measure utilization at peak; add headroom before 70% becomes 95%
Average looks fine, users still complain Tail latency Stop looking at averages. Pull p99, find the slow dependency
Fast in Mumbai, slow in Boston Latency (distance) Regional deployment or edge caching. Physics will not negotiate

Common Mistakes

1. "Slow? Upgrade the server." The Pune startup story. If the time is spent in round trips or queues, a faster CPU changes nothing. Diagnose which of the three metrics is the problem before spending.

2. Reporting average latency. Averages hide the tail, and the tail is where users suffer. A green average dashboard with an ignored p99 is how "the app is fast" and "users say the app is slow" stay true at the same time.

3. Running hot and calling it efficiency. A service at 95% utilization looks cost-efficient on a finance slide and is one traffic blip from queue collapse. Headroom is insurance, and near the saturation point, latency rises long before utilization hits 100%.

4. Confusing bandwidth with throughput. "We have a 10 Gbps link, the network is not the problem" tells me the pipe is wide, not that anything is flowing through it. Measure actual throughput at the choke point.

5. Chatty service design. Forty sequential micro-calls to render one screen works in localhost testing, where round trips cost 0.1ms. In production across regions, the same design costs seconds. Count your round trips like you count your money.

๐Ÿง  Key Takeaways

  • Bandwidth = lanes (capacity). Throughput = cars per hour (reality). Latency = one car's travel time. Three dials, three different fixes.
  • Throughput is always โ‰ค bandwidth. The toll booth, not the highway, sets the limit.
  • Latency is bounded by physics. Distance and round trips dominate. Fewer trips beats faster servers.
  • Averages lie, percentiles do not. Set SLOs on p99. With fan-out, the tail hits a quarter of your pages.
  • Systems collapse near saturation, not at it. Target 60 to 70% utilization. When throughput saturates, latency explodes, because the queue IS the latency.
  • Carry the latency numbers table in your head. It lets you smell impossible designs before they are built.

Think About It

  1. Hotstar during an IPL final serves 25 million concurrent viewers. Which of the three metrics dominates the video delivery path, which dominates the "live score ticker" path, and which dominates the "place a fantasy team bet" path? Three different answers expected.

  2. Your p50 is 60ms and your p99 is 4 seconds. The same code serves both requests. List four physically different reasons the p99 requests could be 60x slower, and the measurement you would use to confirm each.

  3. A teammate proposes moving your Mumbai-hosted API to a cheaper US data center because compute is 30% cheaper there. Your users are 90% Indian. Walk through what happens to each of the three metrics, and estimate the latency cost per request using the numbers table.

Further Reading

Quiz available inside the full course after you request access.