What We Did to Make Sure Our Systems Didn't Fall Over During an IPL Ad

A few years back, we were about to run an IPL advertisement that would send a very large amount of traffic to our platform in a very short time.

This wasn't one of those "traffic might increase" situations. We knew fairly well that once the ad aired, there would be a sudden spike, and if things went wrong, they would go wrong very publicly.

So instead of scrambling on the day of the event, we put together a fairly exhaustive readiness plan and worked through it step by step in the weeks leading up to the event. Internally, it ended up becoming a 28-point agenda that we followed almost religiously. We used the same checklist every year to make sure we had things under control before airing the ad in IPL.

This post is not about the checklist, but the thought process behind creating that checklist and how it helped us be prepared.

Feature flags for everything

Checklist item: Feature flag for every feature

The first thing we enforced was simple: every feature had to be behind a feature flag.

Not just new features — everything that could reasonably be toggled.

The goal wasn't experimentation. The goal was control. If something started misbehaving under load, we wanted the ability to turn it off without redeploying or making code changes under pressure.

Reducing scope before reducing capacity

Checklist item: Identifying decreased scope for all use cases

For every major user flow, we asked a slightly uncomfortable question:

"What is the absolute minimum version of this that still works?"

This led to things like:

Temporarily disabling non-essential validations
Removing optional API calls
Simplifying response payloads
Turning some flows read-only

None of this was permanent, but all of it reduced load and complexity during the spike.

Logging that we could actually use

Checklist item: Logs to be configured for various APIs

We spent time making sure logs were in a good state before worrying about dashboards and alerts.

That meant:

There were sufficient logs in the code to debug easily
Log level was properly configured to view the logs in Kibana
All the developers on the team knew how to use Kibana to view logs

When traffic spikes, logs are usually the first place you look. If they're noisy or incomplete, you lose precious time.

Alerts that pointed to real problems

Checklist items: Alerts for all the APIs

APM alerts

Sentry alerts

ELB alerts

Kibana alerts

Infra level alerts (CPU, MEM, Network Bytes, 4xx and 5xx errors, Target Response times)

Service down alerts

We were careful about alerts, because alert fatigue is very real.

We set up alerts across:

APMs for latency and error rates
Sentry for exception spikes
Load balancers for 4xx / 5xx errors
Infrastructure metrics like CPU, memory, network usage
Basic service health checks

The rule we followed was: if an alert fires, it should indicate something actionable, not just "something changed".

Dashboards for different audiences

Checklist items: Dashboards for various services set up

Tech - Kibana, Cloudwatch, Sentry

Product Dashboards - Amplitude/Periscope

We didn't try to create a single mega dashboard.

Instead, we had:

Technical dashboards for engineers (latency, errors, saturation)
Product dashboards for business and product teams (funnels, drop-offs, conversions)

This helped avoid situations where engineering thought things were "fine" while the business was seeing problems, or vice versa.

On-call planning without heroics

Checklist item: On-call support schedules

We made sure on-call schedules were clearly defined, backup DRIs were identified, and escalation matrices were configured properly for critical systems. We even tested if the escalations were working correctly.

Making sure events existed where they mattered

Checklist item: Make sure there are events on all front-end and backend flows

We reviewed both frontend and backend flows to make sure important events were being emitted.

This helped us answer questions like:

Are users getting stuck at a particular step?
Is a failure happening before or after a key action?
Is this a system issue or a UX issue?

Events when correlated with the logs become a powerful tool for debugging.

No open P0s

Checklist item: P0 bugs to be fixed on priority

This one is straightforward. Any P0 bug was fixed before the ad. We didn't carry risk knowingly into the event.

A real code freeze

Checklist item: Code freeze to allow sufficient time for QA testing

We enforced a proper code freeze a week in advance.

This gave QA enough time to test, and it also gave us time to observe system behavior without constantly changing variables. It wasn't popular, but it was necessary.

Talking to third-party providers

Checklist item: Informing 3rd parties in advance for the spike in traffic

For the end user, if a functionality doesn't work on our app they will lose trust in us even if it is a failure because a third-party had not scaled up their system. So we informed all major third-party partners about the expected spike in traffic and asked them also to scale up their systems and be prepared for traffic spikes from our end. The heads up was well received by most of the third parties.

Load testing beyond "reasonable" numbers

Checklist items: Load Tests on critical API endpoints

Conduct load test for 10 - 15x traffic

Identify bottlenecks

Plan for optimizations

We load tested critical APIs at 10–15x our expected traffic.

The goal wasn't to hit some theoretical max number. It was to find weak spots early — slow queries, lock contention, connection pool exhaustion, cache misses, and so on. We made a note of the observations about the latencies, the CPU/Memory consumptions, and 4xx/5xx errors during the scaled version. This exercise showed us the probable holes that would show up at scale. We made it mandatory to close all the observations by conducting the load tests multiple times before the go live.

Access checks for people, not just systems

Checklist items: Do all team members have access to appropriate tools and services?

Check permissions for all tools and services

For critical services, identify backup DRI and make sure they have the appropriate accesses

We had conducted a mock drill of issues related to scale. We found out that some engineers didn't have access to the tools required (Sentry, Kibana, Opsgenie etc).

For critical services, we explicitly identified backup owners and verified their permissions too. This sounds mundane, but it has saved us more than once.

Fire drills and runbooks

Checklist items: Fire drill run book to be in place

Create the firedrill run book

Define various adverse scenarios and their execution plans

Identify internal and external spocs for various scenarios

Write down the manual, automation, and hybrid solutions for identified adverse scenarios

We wrote down what we would do if specific things went wrong.

Not in abstract terms, but as concrete steps:

What happens if a service goes down?
Who do we contact?
What can we disable?
What can we recover manually?

Having this written down reduced decision-making during the event itself.

Security cleanup before scale

Checklist items: Security

Deprecated all older and unused API endpoints?

Security groups defined and configured properly for all service end points?

We also used this time to clean up security-related issues:

Deprecated unused APIs
Tightened security groups
Reviewed exposed endpoints

Traffic spikes tend to attract unwanted attention as well.

Making sure infra could scale on its own

Checklist item: Infra - Scaling policies in place for all the services?

Finally, we validated auto-scaling policies across services.

During the load testing we found out that some of the newly deployed services didn't have auto-scaling policy enabled. It was enabled at that stage too. This step was mostly helpful as a checklist to make sure that the auto-scaling policy is not only enabled but also configured properly to handle the scale.

We didn't want anyone manually increasing capacity while the event was live. Scaling needed to happen automatically and predictably.

How it played out

The entire engineering team was on a Google Meet call monitoring the dashboards of their respective services. The same set of engineers were also active in a Slack thread to help respond to any scale related incidents. When the ad finally aired, traffic spiked exactly as expected. There were a few minor issues raised but nothing that caused panic.

Being paranoid to be prepared helped!

What this taught us

The biggest takeaway for me was that scale events are rarely about last-minute fixes. They're about preparation, discipline, and being honest about where your system is fragile. We negotiated with the management to spare 3 weeks to help us prepare for this exercise. Every team had to pick a slot with the DevOps and QA teams to schedule the load tests multiple times until all the issues were fixed.

If you know you're heading into a high-traffic moment, start early. Most problems are obvious when you give yourself enough time to look for them.