The Green Checkmark That Lied

It is 11:58 PM on the last day of the month. A migrant worker in Pune opens his UPI app to send ₹15,000 home before his brother's school fees are due. He taps Send. The app responds in 300 milliseconds with a green checkmark. Transfer successful.

The app was up. The app was fast. The app survived month-end traffic that would melt most systems.

The money never left his account, and it never reached his brother. A background worker had crashed mid-transaction, and the system lost the transfer while telling him it succeeded.

Ask yourself: was this system available? Absolutely, it responded instantly. Was it scalable? Clearly, month-end peak did not faze it. Was it reliable? Catastrophically, no.

The lesson buried in this story: a fast, always-up system that lies is worse than a system that is down. Down, the user retries later. Lying, the user walks away certain his money is safe.

Engineers throw around "scalable", "available" and "reliable" as if they all mean "doesn't suck". To an architect they are three different forces, frequently at war with each other. Confuse them and you will optimize the wrong one while the business burns.

Why Should You Care?

Every NFR conversation is built on these three words. When a PM says "it needs to be robust", your first job is translating that into which of the three they actually mean, because each one costs differently.
Interviews probe the distinction directly. "Your system shows 99.99% uptime but users report lost orders. What's going on?" is an availability-vs-reliability question. Candidates who treat them as synonyms fail it.
You cannot buy all three at maximum. Each extra nine of availability can double infra and on-call cost. Knowing which property your product actually needs is a budget decision worth crores.

🟢 The Simple Version

The Wedding Caterer Analogy

You are catering a wedding.

Scalability: the baraat shows up with 300 uninvited guests. Can your kitchen expand, with more burners and more cooks, without the food quality collapsing?
Availability: when a guest walks up to the counter at any point in the evening, is someone there to serve them? Even if the paneer is finished and they get only dal, they got a response.
Reliability: the guest who asked for the Jain thali. Did they get the actual Jain thali, every single time? Serving fast and serving wrong (onion in the Jain plate) is the worst outcome in the room.

Scalability: Can It Handle More?

Scalability is about capacity under growth. 1,000 visitors today. An influencer posts your link, and tomorrow it is 1,000,000. Does the system expand gracefully or burst into flames?

Two ways to grow:

Vertical scaling (scale up): buy a bigger server. 128 cores, terabytes of RAM. Simple, zero code changes. But it hits a hard ceiling. Eventually there is no bigger computer to buy, and the price curve goes exponential long before that.
Horizontal scaling (scale out): add more cheap, identical servers. No theoretical limit. But now you need load balancers, your app must become stateless so any server can handle any request, and a whole family of distributed-systems problems has just RSVP'd to your party.

The metric: requests per second the system sustains while latency stays in budget. Scalability without a latency condition is meaningless. Anything "handles" a million requests if you let them queue forever.

Availability: Is It Responding?

If a user opens your app right now, do they get a response, or a spinner that times out? Availability is measured in nines:

Nines	Downtime per year	What it really means
99% ("two nines")	~3.65 days	Hobby project
99.9%	~8.8 hours	Standard SaaS promise
99.99%	~52 minutes	Serious money: redundancy everywhere
99.999%	~5.3 minutes	You will not be deploying on Fridays. Or most other days.

High availability is bought with redundancy. Assume servers die, cables get cut, data centers flood. When Server A dies, Server B takes over before the user notices.

The crucial subtlety: a system is "available" even when it serves slightly stale data or a graceful error. Available means it answers. It says nothing about the answer being right.

Reliability: Is It Correct?

Reliability is the promise-keeping property. The system does exactly what it said it did. Money debited means money credited. Order confirmed means the restaurant actually received it. No silent data loss, no duplicate charges, no "successful" writes that vanish.

This is why the UPI story stings. Reliability failures wear a success costume. An outage announces itself. A correctness bug gets discovered weeks later in a reconciliation report, or in a customer-support firestorm.

The metric: correctness rate of completed operations. A system can have 100% uptime and a 0.1% silent-corruption rate, and the second number is the one that kills the company.

🟡 Going Deeper: How the Three Fight Each Other

These three are not a checklist where you tick all the boxes. They are a budget, and over-spending on one starves the others.

Availability vs Reliability, the fundamental tension. When a component fails mid-operation, you face a small version of the CAP choice: respond anyway with possibly wrong or stale data (availability wins), or refuse to respond (reliability wins). A stock-trading platform during a network blip rejects your trade with "service temporarily unavailable", because executing the wrong trade is lawsuit territory. A meme feed during the same blip serves you slightly stale memes, because who cares.

Scalability vs Reliability. Scaling out usually means replicating data, and replicas can disagree (you will meet replication lag properly in Phase 5). The most scalable architectures relax correctness guarantees to keep nodes independent. That relaxation is a reliability concession, deliberately scoped to data that tolerates it.

Scalability vs Availability, the sneaky one. Auto-scaling protects availability under load, until the scale-out event itself takes the system down. Fifty new servers stampede the database with cold-cache traffic, connection pools exhaust, and the cure causes the outage. More moving parts means more failure modes. The scaling machinery itself has to be made reliable.

What an Extra Nine Actually Costs

Moving from 99.9% to 99.99% is not "10x better for 10% more money". It typically means multi-AZ redundancy for every tier (roughly 2x infra), automated failover that is actually tested (engineering quarters), a real 24/7 on-call rotation (people's weekends), and zero-downtime deploy machinery. Moving to 99.999% adds multi-region active-active and a change-management culture that slows every feature.

The architect's question is never "how available can we be?". It is "what does each extra nine earn the business?". A B2B dashboard used 9-to-6 on weekdays needs 99.9% during business hours. Paying for five nines there is burning money to impress a status page.

The Green-Checkmark Anatomy

The UPI story's failure pattern deserves dissection, because it is the most common serious bug class in distributed systems:

The API server receives the transfer, writes "initiated" to its DB, returns 200 OK. Availability metrics look perfect.
A background worker picks up the transfer and crashes halfway.
Nothing retries. The transfer stays half-done forever. Reliability is silently broken.
Dashboards: all green. Uptime: 100%. Customer trust: gone.

The fixes belong to later weeks (transactional outbox, idempotent retries, reconciliation jobs, in Phases 3 and 7). The diagnostic skill belongs here: uptime dashboards measure availability; only end-to-end checks measure reliability. If your only alert is "is the service up", you will be the last person to know it is lying.

// Availability metric: did we respond?
//   200 OK rate = 100%   <- looks perfect
//
// Reliability metric: did the money actually move?
//   SELECT count(*) FROM transfers
//   WHERE status = 'initiated'
//     AND created_at < now() - interval '15 minutes';
//   -> 4,217 rows   <- four thousand lies
//
// The second query is a reconciliation check. It is reliability's
// answer to the uptime dashboard. Every money system needs one.

🔴 Architect's Corner

SLI, SLO, SLA: Making the Words Contractual

In production organizations these properties stop being adjectives and become numbers with consequences:

SLI (indicator): the measured thing. "p99 latency of /checkout", or "fraction of orders that complete end-to-end".
SLO (objective): your internal target. "99.95% of checkouts complete within 2 seconds, measured weekly."
SLA (agreement): the external promise with penalties. "Below 99.9% monthly, customers get service credits."

The senior insight: define reliability SLIs, not just availability SLIs. "Service returns 200" is easy to measure and easy to game (the green checkmark again). "Order reached the restaurant within 60 seconds of confirmation" is the SLI that matches what the business actually sells. Google's SRE practice calls this the difference between measuring your servers and measuring your users' experience.

Error Budgets: The Peace Treaty

A 99.9% SLO gives you a built-in allowance of about 43 minutes of failure per month. That is an error budget. Spend it deliberately on risky deploys, chaos experiments, infrastructure migrations. When the budget is exhausted, feature work pauses and stability work takes over. This turns the eternal dev-vs-ops war ("ship faster" vs "break nothing") into a number both sides can negotiate with.

Graceful Degradation: Availability's Senior Form

Binary up/down thinking is the junior version. Senior systems degrade in layers. When Hotstar's recommendation engine struggles during an India-Pakistan final, they do not drop the stream. They serve a cached "popular now" rail instead. When Swiggy's ETA service is overloaded, you can still order food. The ETA just says "45-60 min" instead of "47 min".

The design move: classify every dependency as critical (the payment ledger, fail loudly if it is down) or enhancing (recommendations, fall back silently). Availability engineering is mostly deciding what is allowed to break quietly.

India-Scale Reference Points

UPI processes 10+ billion transactions a month with published success-rate targets per transaction. An explicitly reliability-first posture. NPCI publishes per-bank failure rates, naming and shaming the unreliable banks.
Hotstar's 25 million+ concurrent cricket viewers is a scalability story, but their famous trick is pre-scaling before the match instead of trusting reactive auto-scaling. They buy availability with prediction, not reflexes.
IRCTC tatkal chooses reliability over availability every single morning. When contention peaks you get errors and queues, but it (mostly) never sells the same berth twice. Infuriating UX. Correct priority.

The Decision Matrix

System	Optimize first	Acceptable sacrifice	Why
UPI / banking ledger	Reliability	Availability (error out under doubt)	Wrong money movement is unrecoverable trust damage
Hotstar live stream	Availability + scalability	Reliability of metadata (counts, recs)	A dropped stream is the product failing; a wrong view-count is not
Zomato order placement	Reliability of the order, availability of browse	Freshness of menus and ratings	Browse can be stale; the order itself must be exactly-once
Instagram feed	Availability	Consistency/freshness	A 10-second-old feed is invisible; downtime is churn
Flash-sale inventory	Reliability (no overselling)	Availability for some users (queue/429)	Overselling 50k phones is a logistics catastrophe
Internal analytics dashboard	Scalability of ingestion	Availability (nobody dies if it is down at 3 AM)	Data volume grows; consumers are tolerant

Common Mistakes

1. "We need five nines." Almost nobody needs five nines. People need the consequences of downtime to be small, and that is often cheaper to buy with graceful degradation and fast recovery than with prevention. Ask what the business loses per minute of downtime, then price the nine against it.

2. Treating uptime as the only health metric. The green-checkmark failure: 100% uptime, silently corrupted operations. Without end-to-end reliability checks (reconciliation queries, invariant monitors), your dashboards measure your servers, not your promises.

3. "Horizontal scaling = available." Fifty app servers behind one database is fifty ways to reach the same single point of failure (the next lesson's whole topic). Scaling the stateless tier without scaling the stateful tier mostly scales your ability to generate load on the bottleneck.

4. Retrofitting reliability. Availability you can often bolt on later: add replicas, add a load balancer. Reliability lives in the application's logic (idempotency, transactional boundaries, exactly-once effects) and is brutally expensive to retrofit. Decide the reliability requirements before writing the payment path, not after the first double-charge.

5. Uniform guarantees across the whole product. The same app can and should be reliability-first on payments and availability-first on browsing. Blanket guarantees mean you are over-paying somewhere and under-protecting somewhere else.

🧠 Key Takeaways

Scalable = handles more. Available = responds. Reliable = responds correctly. Three different properties, three different price tags.
Available-but-unreliable is the most dangerous state. The system lies with a green checkmark. Down is recoverable. Wrong often is not.
Each extra nine roughly doubles cost in infra, people and process. Price nines against business loss, not pride.
Measure reliability with end-to-end SLIs, not just uptime. Reconciliation queries catch what dashboards cannot see.
Degrade in layers. Classify dependencies as critical or enhancing, and let the enhancing ones break quietly.
Different features deserve different guarantees. Reliability-first money paths and availability-first browse paths coexist in every good design.

Think About It

Swiggy on New Year's Eve. Order volume spikes 5x and the restaurant-side tablet network gets flaky. You must choose: accept orders optimistically (some will fail at the restaurant after payment), or show "restaurant unavailable" aggressively (lose orders that would have succeeded). Which property are you trading against which, and what middle path could you design?
Your CEO read about five nines and wants them "across the platform". Budget approved. Which parts of a food-delivery platform would you actually take to 99.99%, which would you leave at 99.9%, and which single metric would you put on the CEO's dashboard instead of uptime?
An e-commerce site's monitoring shows 100% uptime and normal latency, but weekly revenue is quietly down 8%. Sketch three reliability failure modes (not availability ones) that could produce exactly this signature, and the end-to-end check that would have caught each.

Scalability vs Availability vs Reliability