Let's start off by saying that I woke up this morning with a case of the Mondays.

Our servers decided to hop on board that boat as well. Go figure.
A rundown of events which occurred on Monday, Mar. 30, 2009:
At 9:15 a.m. PDT, both our Lighthouse and Tender services went down. We immediately contacted our host, Engine Yard, who was aware of the problem.
A number of other services were also affected, including GitHub, Pivotal Tracker and Engine Yard's personal website.
Engine Yard support contacted their West Coast data center, Herakles, which was reporting upstream issues with their connectivity.
At 11:56 a.m. the issue was elevated to Cisco engineers who came in to assist in the matter.
At 12:22 p.m. I noticed 42 new gray hairs and was filing an insurance claim for premature joint disease due to clicking refresh in my browser 36,000 times while restlessly waiting for updates from the data center.
At 2:49 p.m. service had been partially restored. Both Lighthouse and Tender were up and running but continued to experience intermittent connectivity issues while engineers at the data center were finalizing the repairs.
At 3:45 p.m. Engine Yard gave the final green light on the stability of service.
The initial report states that the outage was due to a failed hardware card in a pair of redundant access switches.
Continued...
While service is stable at the moment, there will be a scheduled maintenance window to bring the systems redundancy back online and ensure the integrity of the repairs. We'll post an announcement if there is going to be any scheduled downtime.
We sincerely apologize about the inconvenience. When our services are down it affects not only your work, but our own client work as well.
For future reference, you can follow the Lighthouse Status blog, http://lhstatus.com, as well as the @lighthouseapp and @tenderapp twitter accounts for status updates and announcements.
We would like to thank Engine Yard for the prompt response times and frequent status updates while the situation was being handled. They have been an amazing supporter of the community and provided great service for us since we launched Lighthouse two years ago. If today was stressful on us, we can only imagine how it was on their end.
We'll be present for the weekly Engine Yard conference call in hopes to discuss measures that will be taken to prevent such an outage from occurring in the future. Monday wasn't fun for anyone.


2 Comments
Guys. Lets hug.
Yesterday was the sort of day that could happen to any business at any time. There will always be unknown unknows…
The thing that matters is how well you reacted and the fact you post explanation and public apologies like this.
I admit I was tearing my hair out at the time, and we had to pretty much down tools during a busy period while our work flow was eaten by no access to Lighthouse But if I reflect further still, I can remember near countless occasions that we have been praising Lighthouse and the way it’s enabled us to work.
All said, please don’t let that happen again.
x
Update: Engine Yard has posted a formal statement on their blog about yesterday’s events—http://blog.engineyard.com/2009/03/31/march-31st-outage
Make your voice heard
Sorry, but comments are closed for this item.