The perils of running servers at Amazon EC2

Last week, some of our developers complained that they could not access our server that we have hosted at Amazon Web Services (AWS) / Elastic Cloud Computing (EC2) environment intermittently every few minutes, but I was having no problems.

The next day, I was now unable to get to the server.
HTTP, HTTPS, SSH all failed.

Cloudwatch showed all metrics as “flat” for the few minutes after I stopped getting a response.

I would like to note that we have been using this server for approximately 6 months without any problems whatsoever. We have a very standard procedure for accessing the server, and we had tested many things before reporting the problem to Amazon. We were sure that we were not mis-typing the SSH host name, and were using the correct instance ID.

Additionally, I saw lots of other posts in the Amazon EC2 support discussion forum which were similar to this post about suddenly not being able to access their server that same morning and/or the night before.

The end result was that Amazon notified us that quote: “There does appear to be a problem with the underlying hardware.” and “I would relaunch the instance using your latest AMI. Since this instance is not EBS, what ever changes you made may be lost.”

Here are some hard lessons that I learned from the experience.

A) As redundant and high availability oriented we thought Amazon was, IT IS NOT.

B) Amazon’s support in such incidents is actually just a discussion forum for the various services that it offers. In this case, it is the EC2 discussion forum, where you are lumped in with thousands of other people running potentially hundreds of other AMIs that contain unlimited number of “popular software” on several popular operating systems. In other words, there is alot of traffic on there because of people running Windows, Oracle, WordPress, Linux (all the flavors you can think of), etc. etc. etc. The traffic is so high that your chances of getting responded to depend on the Amazon staff manning the forum, and whether your needle doesn’t get lost in the high traffic hay stack.

C) Amazon does offer phone support, but this is only available from the Gold and Platinum support levels. Gold costs $400/month, so unless you are enterprise, you can forget it. Bronze and Silver cost $49 and $100/month respectively.

Full details: http://aws.amazon.com/premiumsupport/

I was really surprised that there was nothing for the average joe running a couple of servers. It costs me more a month to run at Amazon than it does my dedicated server hosting company, yet at least I can get someone on the phone at the dedicated server hosting company.

D) Amazon’s attitude in their response and the response I’ve seen many times since having this issue is that you should never rely on an instance whatsoever. Count on it going down at any time. Instead, have load balancers and custom built, EBS boot AMIs ready to go at any time.

This is a difficult philosophy to totally grasp, but one that must be grasp if you are moving to the cloud.

E) We actually had an EBS volume attached where our /var/www and MySQL resides. We had previously tested disaster recovery in the event the instance went away and had successfully detached, spun up a new instance, and re-attached the volume and ran a script to let the server know where the /var/www and MySQL data was.

This is a good practice for the data, but it should also be the case for the server.

F) Instance store is no way to go, unless you are literally spinning up a temporary test server to run for a few minutes or a few hours, or you are spinning up an instance for the CPU cycles.

G) EBS Boot instance should really be the rule of thumb, and generate a master AMI from it. If you make changes (periodic operating system updates, etc.) – also periodically re-bundle your AMI.

H) Make sure your data is also stored on a separate EBS volume, and make automatic nightly snapshots of that EBS volume. Think of EBS snapshotting in the same way you think of incremental backups in the tradition IT infrastructure sense.

By having your data on a separate EBS volume, you don’t necessarily have to re-bundle your AMI every night.

Someone out in the websphere correct me if I’m wrong, but it seems to me that snapshotting a data EBS volume every night is easier than re-bundling your instance every night in the case that you store your data on the same EBS volume as your instance operating system and software.

Thoughts and feedback appreciated.

One Response to “The perils of running servers at Amazon EC2”

Leave a Reply

Consulting

I'm currently available
for Lotus Notes / Domino consulting engagements.

LinkedIn

Connect with me:

LinkedIn

Advertisement
Advertisement
Categories