Postmortem of Oct. 5, 2011 downtime

nicole

Posted by nicole at October 5th, 2011

As most are aware, Lighthouse had some server issues early this morning (or afternoon depending on your timezone).

The application master on our Amazon cloud instance locked up around 3:30am PST (original cause still being checked into). Then our hosting provider stack failed to promote one of the other application servers to Master. We had our hosting provider developers come in and manually remove the instance and add a new one, but the resulting flood of reloads and access attempts brought Lighthouse right to its knees. Once Lighthouse was up again we started seeing errors from various processes that didn't restart as expected or needed time to get cycled up.

We have escalated this issue all the way to the executive level at hosting provider, because a Cloud app solution that doesn't automatically remove a failing server and promote an existing server to cover the bases isn't much of a cloud app solution. The stability of our application stack is very important to us and we're currently evaluating what steps we can take to ensure this doesn't happen again. We are in active (and frequent) communication with hosting provider about this very thing today, and we will do everything we can to ensure the stability of Lighthouse, including changing hosting providers if necessary.

Make your voice heard

We value freedom of speech, but please don't be an asshat. You can use Textile in your comments. Surround code in a <macro:jscode lang="LANG"> block.