A couple of points: Autoscaling is your friend. If you're not leveraging it (mul...

spullara · on Sept 25, 2014

They are not rotating capacity for updates. They are patching a Xen security issue that will be announced on Oct 1. That is why they are rebooting machines and not forcing moves off of those machines. Otherwise, I agree with the advice.

coolbeans01 · on Sept 25, 2014

Could you please confirm or provide evidence for such speculation?

I don't see anything about this on google.

I heard it is because they are having power issues within their datacenter.

jabo · on Sept 25, 2014

May be this: http://xenbits.xen.org/xsa/

XSA-108 2014-10-01 12:00 (Prereleased, but embargoed)

coolbeans01 · on Sept 25, 2014

Cool speculation!

Anyone with anything concrete?

I still heard it is power issues.

reedloden · on Sept 25, 2014

Not sure how power issues would affect every single region. Logic dictates it's likely a security issue.

jsmthrowaway · on Sept 25, 2014

I'd just like to point out that you asked for concrete, then speculated based on something you heard.

aioprisan · on Sept 25, 2014

Aren't you speculating as well?

mdellabitta · on Sept 25, 2014

Restarting an instance (not stop-start) doesn't change the hardware you're on, so I don't think this has a physical explanation.

tehmasp · on Sept 25, 2014

all of our instances scheduled for maint. are indeed system-reboot event types; it is indeed a stop/start situation.

stop/start can possibly put you on another physical - but it all depends on how aws has setup the hypervisors and their instance schedulers.

this, the maint from aws, appears to be security related - but that doesn't mean that aws is not getting folks off of old hardware if they have that desire.

mdellabitta · on Sept 25, 2014

I think it's more complicated than that. Because Amazon is claiming that if you let them handle it, you keep your instance data. That's not a stop-start (at least what we ordinary users can do).

It's more like a system restart with a little downtime managed by them.

You can try a stop-start yourself, but it's not guaranteed to help. And a restart yourself doesn't do anything.

count · on Sept 25, 2014

within their datacenter? Wtf? Do you have any idea how big AWS is?

Xorlev · on Sept 25, 2014

This helps services, but as some point you have to run a database layer too ;) Cassandra helps, but it's not the whole story during large-scale close together reboots like these.

It ends up being a decent bit of manual operator time spent when security patches force AWS to reboot. You have to be pretty careful about any service that can't lose members as quickly due to bootstrap times or technology limitations, e.g. RDBMSes.

For things running in an ASG, it's trivial to let it die or just kill it.

pas256 · on Sept 25, 2014

Each AZ looks to me like it is a day apart. Surely that is enough time.

kennu · on Sept 25, 2014

Autoscaling is your friend, but you can also use "auto-healing" if your stack is built on Amazon's AWS OpsWorks and you just want to keep a single instance alive. It will automatically spawn a replacement instance and reattach and mount any EBS volumes.

eropple · on Sept 25, 2014

My understanding is that that doesn't work correctly in the case of AZ failure; the EBS data isn't duped to another AZ and so your instance will fail to come up. So it's not really a solution.

kennu · on Sept 26, 2014

Quite possible. But it depends on what level of disaster you want to protect yourself from. Single EC2 instance termination is much more common than an entire availability zone going down. I'd say OpsWorks auto-healing is better than just running a standalone EC2 instance, and configuring a full auto-scaling setup with your own custom AMIs and boot scripts is even better, but also much more work.

eropple · on Sept 29, 2014

Fair enough. Not saying it's not usable for some stuff, by any means. (Though OpsWorks in general leaves me feeling a little itchy.)

madaxe_again · on Sept 25, 2014

Yes, isn't it wonderful that you have to completely remake your autoscaling groups every time you add/remove/modify an ELB. Thank Cthulhu (or opscode...) for chef.