18th May 2020 Downtime Postmortem – WazirX

What happened today?

Today, at approximately 6 AM IST, our infrastructure experienced degradation. Usually, most infrastructure degradations self-correct without affecting our services. However, sometimes, infrastructure cannot self-correct, and in such cases, we have alert systems in place to alert our team. Unfortunately, this particular scenario was not covered under our alert system due to which our team wasn’t alerted early.

The solution to such an infrastructure problem is very simple, we spin up new instances and gracefully move the work from degraded infra to the new infra. This usually gets completed within a few minutes, and our users never get to experience any issues when this infrastructure switchover is in progress.

Now, since there was a delay in identifying the issue, a separate infrastructure that holds all the pending orders was getting filled up quickly since users were sending in a lot of requests. By the time we fixed our infrastructure, the pending orders had reached a large number. We had to wait for all of these pending orders to process before we could allow users to start placing new orders. This is where the maximum time went by. There’s not much we can do in such cases except wait for the pending orders to be processed.

What will we do in the future to avoid this?

We’ve identified the scenario that leads to this. While not every infrastructure degrade can be avoided, we can always ensure our team is alerted immediately so that such downtimes don’t occur. We’ll ensure that we’re notified early when such scenarios occur again. We’ll also put more effort into identifying any other scenarios that may have escaped our alert systems.

On behalf of my team, I apologize for the inconvenience caused to you. We’ll work harder on making WazirX more robust with each passing day.

Jai Hind!

Nischal Shetty
Founder & CEO, WazirX

Articles in this section

Related articles