Real-Time

Remaining up and running during AWS outages

0 MIN READ • Scott Hill on Oct 21, 2025
Remaining up and running during AWS outages

At PubNub, we know that transparency is key. Our customers want to see us as an extension of their ops team. So, our philosophy is to “over-communicate” wherever possible, including our status page. We do this whenever latencies increase or when there is an intermittent increase in error rates that require retries—even if they remain with our SLA commitments. Yesterday, we put this into practice.

PubNub Successfully Navigated A Huge AWS Incident

Throughout the massive AWS incident in US-East-1, PubNub stayed up-and-running throughout. We re-routed our traffic away from US-East-1 as that region failed, and as the day continued we added capacity in various regions to meet demand. As US-East-1 recovered, we moved regional traffic back to that data center.

Throughout that period, we took a very transparent approach to communication. Once our systems detected the outage, we began routing across to other regions, we updated our status page, and we shared that warning with customers. And, even after we re-routed traffic globally we kept our service marked as “degraded” simply because, well, the physical distance to other regions was going to increase latencies. The good news for our customers is that our service remained running throughout the incident, and the overall error rate during the data center transitions was very low. See the charts below from our internal Grafana metrics:

This chart show the impact of the outage as reprsented by the yellow line (errors). Can't see it? Let's zoom in

Here, I've circled it.

The question then becomes: when do we give our customers the all-clear? Of course, there’s an argument to be made that if no subsequent issues are occurring, we should remove any status alerts. But for us, given the uncertain updates coming from our sources within AWS, it felt irresponsible to mark our status as “recovered” in US-East-1 while we were remaining in a heightened state of alert and readiness.

Ultimately, we knew that many of our customers were fighting their own fires related to the AWS incident, and for us, it was a no-brainer to communicate clearly that we were not one of those fires for them to worry about.

What Did We Learn?

We’ve added a number of new services to our real-time platform over the past couple of years, and this was the first big AWS incident for many of them; a great opportunity to look for ways to improve our runbooks for such times. We did find a number of small issues in some of those runbooks for the newer services, giving us an opportunity to better improve some of our internal practices.

That's part of how you become a battle-tested service: by experiencing tough battles. I'm happy to report that none of these gaps were critical, but we can certainly be better next time something happens with the improvements we are making based on the experience we gained yesterday.

There were a lot of learnings from this incident: we do drills around these kinds of events to ensure that we are prepared when one happens, but there's nothing like the ‘thrill’ of experiencing the real thing, at midnight.

Transparency as a Philosophy

Transparency is key when partnering with a mission-critical infrastructure. Our global operations and support team take this philosophy seriously, and are proud that we maintained our SLAs, both for operations and support response times. We encourage people to review our status page. We don’t just share incidents, we alert on latency increases, and we also show real-time latency graphs across many services. Further, we offer our customers real-time dashboards that they can embed directly into their own operational dashboards, with data that comes from the same systems we use internally to operate PubNub.

As the VP of Engineering, what I look for when I evaluate vendors is not if they have incidents, but how they react and communicate when they happen. I try to ensure that we behave like I want my vendors to act. If you are evaluating a vendor and see that it has no incidents reported on their status page (or don't have one!), don't walk, run.

Going Forward

I’m proud to say that we did what we do best: we allowed our customers to focus on other things, trusting that we kept their real-time applications running smoothly. We strive to always do that, and to communicate as you'd expect from your own internal teams and close partners.

Feel free to reach out to our world-class support team if you have any feedback or questions stemming from yesterday's incident: support@pubnub.com.