Monday Morning. The CEO Has Seen a Demo.

You are the senior-most engineer at a 12-person Bangalore startup. Monday standup is about to begin when the CEO walks in with his laptop open and that dangerous shine in his eyes.

"I saw a demo of a video platform yesterday. We need to build a streaming product that rivals Hotstar. Investors are asking. We need it live in three weeks, our total cloud budget is $500 a month, and the team is what it is." He points at the three fresh graduates who joined last month.

Your junior dev has already opened a tab. "Kubernetes multi-region setup with microservices and gRPC." He starts sketching seventeen boxes.

He has already failed. Not because Kubernetes is bad. Because he answered before understanding the question. Three weeks kills microservices. $500 a month kills the multi-region cluster. And three fresh graduates cannot debug a service mesh when it falls over at 3 AM. He ignored every constraint in the first thirty seconds.

Let me say the thing this whole lesson exists to say. System design is not about knowing every technology in the world. It is not about memorizing Kafka internals or how MongoDB indexes documents. If you learn system design by memorizing tools, you will fail interviews. Worse, you will build terrible systems in production.

System design is one thing: decision-making under constraints.

Why Should You Care?

Every interview question is secretly a constraints question. "Design Twitter" has a different right answer at 1,000 users and at 100 million users. Interviewers reject people who start designing without asking what the constraints are.
This is the actual job at senior levels. A junior engineer asks "what is the best database?" A senior architect asks "what are we willing to sacrifice?" The promotion you want is on the other side of that question.
Over-engineering kills companies. I have watched startups burn their funding building for a scale they never reached. Knowing what NOT to build is the most valuable skill in this field.

🟢 The Simple Version (Start Here If You're New)

The Shaadi Planning Analogy

Forget servers for a minute. You have been put in charge of your cousin's wedding.

Four things squeeze every decision you make:

Budget. ₹12 lakh, total. Most of it already promised to the caterer.
Date. The muhurat is in six weeks. You cannot move it. The venue books up.
Guest count. Could be 400. Could be 900, depending on whether the Delhi side actually shows up. You have to plan for both without paying for both.
The team. Your event staff is whichever uncles and cousins volunteer. Sharma uncle says he can handle the sound system. He cannot handle the sound system.

Now, should you book the five-star banquet hall or the community ground? There is no universal answer. The five-star is better on every quality metric. It is also the wrong answer if it eats 80% of your budget. The community ground with great food might be the masterstroke.

That is system design. Money, time, traffic (the guests), and team skill, all pulling against each other. The architect is not the person who knows the fanciest venue. The architect is the person who makes the right call for these constraints, and can explain what they gave up.

The Four Forces

Back to software. Every time you draw a box on a whiteboard, you are pulling at four invisible strings. Pull one and another snaps.

1. Money / Budget. Hardware costs, cloud bills, managed service fees. You can solve almost any scaling problem by throwing a million dollars at AWS. If your company is not making a million dollars, you are out of a job. The reverse mistake is also real: spending engineer-months to save $50 a month on infra is money set on fire.

2. Time to Market. How fast can your team actually deliver? If a competitor launches next month, the "worse" architecture that ships in two weeks beats the elegant one that ships in six. Using Firebase instead of a custom backend is not laziness. It is buying time with money, and that is often the correct trade.

3. Traffic / Scale. Are you building an internal dashboard for 50 employees, or the voting backend for Bigg Boss where 50 million people hit submit inside a 10-second window? These are not the same problem. Pretending the dashboard needs the voting system's architecture is how budgets die.

4. Team Skill. The constraint engineers love to ignore. If you deploy Cassandra because a blog said Apple uses it, but nobody on your team can tune its compaction or debug a split-brain at 3 AM, you have not adopted a database. You have armed a time bomb. Operational complexity is a real cost, and it gets paid in on-call tears.

What Changed When You Became "The Architect"

When you write code, your job is making it work. When you design systems, your job becomes making it survive. Survive traffic spikes. Survive a region outage. Survive the original developer quitting. Survive the company growing 10x.

That shift changes the questions you ask:

The coder asks	The architect asks
What's the best database?	What does this database cost us, and what are we sacrificing?
How do I make this fast?	How fast does this need to be, and what does each millisecond cost?
What does Netflix use?	What do Netflix's constraints have in common with ours? (Usually: nothing.)
Does it work?	How does it fail, and what do users see when it does?

🟡 Going Deeper: The Architect's Decision Loop

Boxes and arrows is what system design looks like from outside. From inside, it is a loop you run for every significant decision:

Question. What are we actually building? Who consumes it? What breaks if it is down for an hour?
Constraints. Money, time, traffic, team. Write them down with real numbers, not vibes.
Options. Always at least two. If you can only think of one design, you do not understand the problem yet. You are pattern-matching to the last blog post you read.
Trade-offs. What does each option cost in the four currencies? What is its failure mode?
Decide and document. Pick one and write down why, so future-you (or your replacement) knows which assumptions to revisit.
Revisit. Constraints change. The decision that was right at 10,000 users may be wrong at 10 million. Architecture is a loop, not a one-time ceremony.

The 4-Question Method

For any technology you are evaluating, in an interview or a design review, force yourself to answer four questions out loud. If you stumble on any one of them, you are not ready to use it.

1. What problem does it actually solve? A problem you currently have? A distributed Redis cache solves database read bottlenecks. If your database serves ten reads a second, Redis solves a problem you do not have, and gifts you several you did not ask for: invalidation bugs, one more thing to operate, one more thing that pages you.

2. How does it work under the hood? Not the C++ source. A working mental model. Knowing that Kafka writes sequential append-only logs to disk explains why it gets massive throughput without holding everything in RAM. If you do not know why it is fast, you will use it in a way that makes it slow.

3. When does it break catastrophically? Every system has a cliff. A relational database breaks when concurrent writes fight over row locks. A NoSQL store breaks when you suddenly need multi-document transactions. Find the cliff before production finds it for you.

4. What is the brutal trade-off? Everything you gain costs something. Faster time to market? You paid in money or quality. Global scale? You probably paid in consistency, and some users will see stale data. Say the price out loud. If a design seems to have no downside, you have not found it yet.

// The 4-question method as a code review comment.
// A junior proposes adding Kafka for "scalability":

// Q1: What problem does it solve?
//     "We might need to handle more events later"  <- not a current problem
// Q2: How does it work?
//     "It's like a queue but better"               <- no mental model
// Q3: When does it break?
//     "...it doesn't? LinkedIn uses it"            <- cliff unknown
// Q4: What's the trade-off?
//     "None really"                                <- price not found
//
// Verdict: not yet. A Postgres table with a `processed_at` column
// handles 50 events/minute just fine, and the whole team can debug it.

Scale Changes the Right Answer

This is the single most important mechanic in system design. The same product has different correct architectures at different scales.

Take "users can upload a profile photo":

At 100 users: store it on the app server's disk. Done in an hour. Correct.
At 100,000 users: object storage (S3), images resized on upload, served through a CDN. Correct.
At 100 million users: S3 plus multi-size pre-generation plus CDN plus a content moderation pipeline plus dedup plus per-region buckets for data-residency laws. Also correct.

All three are correct at their scale. The third design at the first scale is not "future-proof". It is three wasted months that can kill a startup. The first design at the third scale is an outage. So when an interviewer says "Design Instagram", the first thing out of your mouth should be a question about scale, because scale decides everything downstream.

🔴 Architect's Corner: What the Job Actually Looks Like

Same Product, Three Right Answers

Watch how the four forces produce different correct designs for the same prompt, "build a food-ordering platform":

The 5-person startup (zero revenue, 3 months of runway): a Django monolith on two VMs, Postgres, Razorpay's hosted checkout, managed services everywhere they can get away with it. Their biggest risk is not scale. It is running out of money before product-market fit. Boring tech is a survival strategy.
The funded scaleup (1 lakh orders a day, 40 engineers): the monolith is now split where it actually hurt. Payments extracted for fault isolation, a read replica for reports, Redis for the menu cache. Kafka only where async genuinely won. Notice what they did NOT do: rewrite everything.
The enterprise (10 million orders a day, 800 engineers, compliance): full service decomposition. But driven by team boundaries (Conway's law) and regulatory isolation as much as by traffic. At this size, architecture is mostly an org-chart and risk-management problem wearing a technical costume.

So when someone asks "is microservices better than a monolith?", the only honest answer is: for whom, and under what constraints?

War Story: The Two-Pizza Graveyard

A few years ago I reviewed the architecture of an Indian B2B startup. Six engineers, around 200 paying customers, maybe 30 requests per minute at peak. They had 14 microservices, a Kafka cluster, a service mesh, and a Kubernetes setup that exactly one engineer understood.

Why? The founding engineer had read about how big tech does it. Resume-driven development did the rest.

The damage was not theoretical. Every feature touched three or four services, so shipping slowed to a crawl. Debugging a failed order meant hunting through logs across services with no tracing. The one Kubernetes-literate engineer became the single point of failure for the whole company, and when he quit, deployments froze for a month. They were paying big-tech operational costs on a chai-stall traffic profile.

The fix took a quarter. Collapse 14 services into a modular monolith with clean internal boundaries. Keep Postgres. Delete Kafka (a jobs table did the same work). And write down the triggers for when extraction would actually be earned: a team too large to share one codebase, or a component with a provably different scaling profile.

Keep some counter-examples ready for interviews. WhatsApp served around 450 million users with roughly 35 engineers on a famously boring Erlang stack. Instagram was acquired at about 30 million users with 13 engineers running Django and Postgres. Shopify still runs one of the world's largest Rails monoliths while processing billions in sales. Scale does not require complexity worship. It requires understanding your bottleneck.

The Questions Seniors Ask That Juniors Don't

In a real design review, the diagram is the least interesting artifact. These questions decide whether the design survives:

"What's the failure story?" Not if the payment gateway goes down, but when. What do users see? What does the on-call engineer see? Is there a runbook?
"What does this cost at 10x traffic?" Some designs scale linearly in cost. Some have cliffs, like the managed service that is cheap until you cross the free tier.
"Who operates this?" Every component you add is something somebody gets paged for. A design's true cost includes its 3 AM cost.
"How reversible is this?" Choosing Postgres is reversible-ish. Choosing your shard key is close to permanent. Spend your deliberation budget in proportion to how hard the decision is to undo.
"What did we decide NOT to do?" A design doc that lists rejected options with reasons is worth ten that don't. It proves the decision was made, not defaulted into.

The Decision Matrix

How constraints map to defaults. Defaults, not laws. The whole point is to re-derive these for your own situation.

Dominant constraint	Bias toward	Indian example
Time to market (competitor breathing on you)	Managed services, monolith, boring stack	Early Dunzo: ship the flow, not the platform
Money (pre-revenue, short runway)	One server tier, Postgres for everything, no Kafka	Most pre-seed startups; your "queue" is a DB table
Traffic (predictable spikes)	Horizontal scaling, caching, pre-scaling for events	Hotstar scaling up before an India-Pakistan match
Traffic (extreme, sustained)	Sharding, async pipelines, multi-region	UPI's transaction backbone (10+ billion txns/month)
Team skill (3 fresh grads)	Tools with huge communities and managed ops	Firebase or Supabase over self-hosted anything
Correctness (money moves)	Strong consistency, idempotency, audit trails	Razorpay's ledger; slow and right beats fast and wrong
Compliance (data residency, audits)	Service isolation, per-region storage, access control	Indian fintechs keeping payment data in-country per RBI

Common Mistakes

1. "What's the best tech stack for X?" There is no best. There is best under these constraints. Anyone who answers a stack question without first asking about your constraints is selling something, usually their own resume.

2. "We should build it scalable from day one." Build it adaptable from day one: clean module boundaries, decisions written down. Actual scale infrastructure (sharding, multi-region, service meshes) built before you need it is inventory that rots. It slows every feature while waiting for traffic that may never come.

3. "Netflix does it this way." Netflix has thousands of engineers and your traffic is a rounding error of theirs. Their constraints produced their architecture. Copying the output without sharing the inputs is cargo-culting. Better question: what would Netflix do with my team and my budget?

4. "The architecture review is about the diagram." The diagram is maybe 20% of it. The review is about failure modes, costs, operational burden, and what was deliberately rejected. A pretty diagram with no trade-off analysis is decoration.

5. "Asking clarifying questions makes me look unsure." In interviews it is the opposite. Diving straight into boxes signals junior. Asking "what's the scale, what's the latency budget, what's the team size?" signals you have done this for real.

🧠 Key Takeaways

System design = decision-making under constraints. Money, time to market, traffic, team skill. Every decision pulls all four strings.
The same product has different correct architectures at different scales. Ask about scale before drawing anything.
Run the 4-question method on every tool: what problem does it solve, how does it work, when does it break, what is the trade-off. Stumble on any one and you are not ready to use it.
Operational complexity is a real cost, paid by your team at 3 AM. "Who gets paged for this?" is an architecture question.
The simplest architecture that meets current constraints wins. Over-engineering for imaginary scale is the most expensive mistake in our industry.
Document what you rejected and why. Constraints change, and written reasoning lets you revisit decisions instead of re-fighting them.

Think About It

The Bigg Boss voting system takes around 50 million votes in a 10-second window after the host says "lines open", and then traffic drops to nearly zero. Designing for the peak means paying for idle capacity 99.9% of the time. What are three fundamentally different ways to handle this traffic shape, and what does each one give up?
You join a startup as the first senior engineer. The existing system is a messy PHP monolith, but it works and ships features fast. The CTO wants a rewrite to "modern microservices". Revenue is growing 20% month over month. Using the four forces, argue both sides. Which force dominates?
IRCTC's tatkal booking opens at 10 AM sharp. Millions of users hammer it at once for a small, fixed inventory of seats. Which of the four constraint forces matters LEAST in that design, and why? (Hint: it is not the obvious one.)

What System Design Actually Is