Preparing for Scale

·7 min read·1,234 words

What We Did to Make Sure Our Systems Didn’t Fall Over During an IPL Ad

A few years back, we were about to run an IPL advertisement that would send a very large amount of traffic to our platform in a very short time.

This wasn’t one of those “traffic might increase” situations. We knew fairly well that once the ad aired, there would be a sudden spike, and if things went wrong, they would go wrong very publicly.

So instead of scrambling on the day of the event, we put together a fairly exhaustive readiness plan and worked through it step by step in the weeks leading up to the event. Internally, it ended up becoming a 28-point agenda that we followed almost religiously. We used the same checklist every year to make sure we had things under control before airing the ad in IPL.

This post is not about the checklist, but the thought process behind creating that checklist and how it helped us be prepared.


Feature flags for everything

The first thing we enforced was simple: every feature had to be behind a feature flag.

Not just new features — everything that could reasonably be toggled.

The goal wasn’t experimentation. The goal was control. If something started misbehaving under load, we wanted the ability to turn it off without redeploying or making code changes under pressure.


Reducing scope before reducing capacity

For every major user flow, we asked a slightly uncomfortable question:

“What is the absolute minimum version of this that still works?”

This led to things like:

  • Temporarily disabling non-essential validations
  • Removing optional API calls
  • Simplifying response payloads
  • Turning some flows read-only

None of this was permanent, but all of it reduced load and complexity during the spike.


Logging that we could actually use

We spent time making sure logs were in a good state before worrying about dashboards and alerts.

That meant:

  • There were sufficient logs in the code to debug easily
  • Log level was properly configured to view the logs in Kibana
  • All the developers on the team knew how to use Kibana to view logs

When traffic spikes, logs are usually the first place you look. If they’re noisy or incomplete, you lose precious time.


Alerts that pointed to real problems

We were careful about alerts, because alert fatigue is very real.

We set up alerts across:

  • APMs for latency and error rates
  • Sentry for exception spikes
  • Load balancers for 4xx / 5xx errors
  • Infrastructure metrics like CPU, memory, network usage
  • Basic service health checks

The rule we followed was: if an alert fires, it should indicate something actionable, not just “something changed”.


Dashboards for different audiences

We didn’t try to create a single mega dashboard.

Instead, we had:

  • Technical dashboards for engineers (latency, errors, saturation)
  • Product dashboards for business and product teams (funnels, drop-offs, conversions)

This helped avoid situations where engineering thought things were “fine” while the business was seeing problems, or vice versa.


On-call planning without heroics

We made sure on-call schedules were clearly defined, backup DRIs were identified, and escalation matrices were configured properly for critical systems. We even tested if the escalations were working correctly.


Making sure events existed where they mattered

We reviewed both frontend and backend flows to make sure important events were being emitted.

This helped us answer questions like:

  • Are users getting stuck at a particular step?
  • Is a failure happening before or after a key action?
  • Is this a system issue or a UX issue?

Events when correlated with the logs become a powerful tool for debugging.


No open P0s

This one is straightforward. Any P0 bug was fixed before the ad. We didn’t carry risk knowingly into the event.


A real code freeze

We enforced a proper code freeze a week in advance.

This gave QA enough time to test, and it also gave us time to observe system behavior without constantly changing variables. It wasn’t popular, but it was necessary.


Talking to third-party providers

For the end user, if a functionality doesn't work on our app they will lose trust in us even if it is a failure because a third-party had not scaled up their system. So we informed all major third-party partners about the expected spike in traffic and asked them also to scale up their systems and be prepared for traffic spikes from our end. The heads up was well received by most of the third parties.


Load testing beyond “reasonable” numbers

We load tested critical APIs at 10–15x our expected traffic.

The goal wasn’t to hit some theoretical max number. It was to find weak spots early — slow queries, lock contention, connection pool exhaustion, cache misses, and so on. We made a note of the observations about the latencies, the CPU/Memory consumptions, and 4xx/5xx errors during the scaled version. This exercise showed us the probable holes that would show up at scale. We made it mandatory to close all the observations by conducting the load tests multiple times before the go live.


Access checks for people, not just systems

We had conducted a mock drill of issues related to scale. We found out that some engineers didn't have access to the tools required (Sentry, Kibana, Opsgenie etc).

For critical services, we explicitly identified backup owners and verified their permissions too. This sounds mundane, but it has saved us more than once.


Fire drills and runbooks

We wrote down what we would do if specific things went wrong.

Not in abstract terms, but as concrete steps:

  • What happens if a service goes down?
  • Who do we contact?
  • What can we disable?
  • What can we recover manually?

Having this written down reduced decision-making during the event itself.


Security cleanup before scale

We also used this time to clean up security-related issues:

  • Deprecated unused APIs
  • Tightened security groups
  • Reviewed exposed endpoints

Traffic spikes tend to attract unwanted attention as well.


Making sure infra could scale on its own

Finally, we validated auto-scaling policies across services.

During the load testing we found out that some of the newly deployed services didn't have auto-scaling policy enabled. It was enabled at that stage too. This step was mostly helpful as a checklist to make sure that the auto-scaling policy is not only enabled but also configured properly to handle the scale.

We didn’t want anyone manually increasing capacity while the event was live. Scaling needed to happen automatically and predictably.


How it played out

The entire engineering team was on a Google Meet call monitoring the dashboards of their respective services. The same set of engineers were also active in a Slack thread to help respond to any scale related incidents. When the ad finally aired, traffic spiked exactly as expected. There were a few minor issues raised but nothing that caused panic.

Being paranoid to be prepared helped!


What this taught us

The biggest takeaway for me was that scale events are rarely about last-minute fixes. They’re about preparation, discipline, and being honest about where your system is fragile. We negotiated with the management to spare 3 weeks to help us prepare for this exercise. Every team had to pick a slot with the DevOps and QA teams to schedule the load tests multiple times until all the issues were fixed.

If you know you’re heading into a high-traffic moment, start early. Most problems are obvious when you give yourself enough time to look for them.