AWS Glue/Athena/Redshift outage

nnx · on June 27, 2019

Should edit title to add “in us-east-1”. Other regions are unimpacted.

thinkingkong · on June 27, 2019

There are so many outages in us-east-1. I've heard the reason is because that's where they roll out maintenance first or something along those lines. Just look at this list of outages on Wikipedia [1] and scan for US-east-1, North Virginia, or "Northeast" (all the same places).

Just don't use US-EAST-1 as your region.

1. https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...

discodave · on June 28, 2019

It's the oldest region, which means:

* It's the largest region (ever had an unexpected scaling bug?).

* It has more legacy stuff lying around. For example, old regions have EC2 Classic, while new regions are VPC only.

* There are more customers there. More whales, more use cases.

Most AWS teams explicitly try not to deploy to us-east-1 first, but because us-east-1 is so different on so many dimensions, it is more likely to have issues that dont manifest elsewhere.

(Source: An AWS Engineer)

Johnny555 · on June 28, 2019

I've heard the reason is because that's where they roll out maintenance first

That doesn't make sense - why would they do maintenance in their largest (and oldest) region first? I'd expect them to roll out changes to smaller regions first so problems will affect fewer users.

I think the more likely explanation is that it's their largest (and oldest) region.

swasheck · on June 28, 2019

an aws tam once told me the same thing. us-east-1a gets the new stuff first. i never validated it against anything other than this one person's statement.

ecnahc515 · on June 28, 2019

"1a" in this context means nothing. The AZ assignments each account gets is random. us-east-1a is probably a different data center for you than me.

empath75 · on June 27, 2019

It’s also full of legacy infrastructure since it was the first region.

chocolatkey · on June 27, 2019

It is usually the cheapest region though. Maybe this is why

jermops · on June 27, 2019

Source? I see price parity across us-east-* and us-west-2 for every service i've looked at.

karavelov · on June 27, 2019

It's the biggest region, if it breaks it breaks in us-east-1.

teej · on June 28, 2019

Redshift changes roll out in us-east-1 after other regions though, so I imagine the root cause is something else.

wgjordan · on June 27, 2019

Summary:

> Between 9:21 AM and 2:36 PM PDT we experienced increased query failures and latency in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

> The issue with the Data Catalog APIs started with a software update in the US-EAST-1 Region that completed at 9:21 AM PDT. The software update was immediately rolled back[...]

nknealk · on June 27, 2019

Thankfully the redshift outage was just on APIs, not existing machines. Our cluster was fine today, but external schema which rely on glue/athena did time out.

nullwasamistake · on June 28, 2019

Cloud services go down more often than my old WordPress sites. Avoiding vendor lock in and doing multi-provider deployments should be par the course.

kache_ · on June 27, 2019

Failovers, man.