Postmortem of Tender issues

rick

Posted by rick at October 27th, 2009

As most of you are well aware, we’ve spent the last few days troubleshooting heavy response times on Tender Support.

Current status

As of 4:30 p.m. PDT, the brunt of the problem appears to be behind us. In the meantime, we’re keeping a close watch on things for when traffic escalates on Tuesday morning. We’re also still going through the data to double check that nothing was lost. So far, all data has been kept intact and delivered.

We would also like to formally apologize about the severity of the issue. We’re just as dependent on our support system as our users and felt their pain every step of the way.

For those of you curious as the the specifics of the server issues, as well as what steps we are taking to prevent it from occurring in the future, please continue reading…

The long story

When issues arose on the Tender Support server late Thursday night, we had the initial symptoms correct, but not the root cause. We thought the backlogged emails were creating the huge load when in fact it was likely due to the European users waking up and starting their day. Increased hits on Tender would lead to increased load on the servers, right?

The weekend was spent pouring through logs, identifying slow requests and optimizing them. This work actually began over a week ago, so you may notice a few minor features here and there (spoilers!).

The hardware

One of the things that seemed odd was the fact that the actual server resources were not actually spiking. Memory usage was fine and CPU usage was completely normal in comparison to previous weeks.

img

As you can see, CPU usage actually went down!

Tender Traffic as a whole

Google Analytics showed no signs of a spike in traffic. Current Tender Support sites were pulling in an anticipated number of hits.

img

Resources and setup

On Sunday night, Engine Yard bumped up our resources and added the HAProxy load balancer.

HAProxy gave us a window into the queueing situation. We can now see which application servers were busy and how many requests were waiting in the queue.

Monday on Tender Support

As Monday morning rolled around, things started to get nasty. In response, Engine Yard support technicians jumped in to help out.

One of the suggestions we received was to test out Mongrel as a ruby application server instead of Thin. Though it didn’t lead to any conclusion, it did lead to the first big discovery to make a difference in our server issues.

HAProxy was showing all of our application servers maxed out. We noticed them hanging on to requests for several seconds, which led to more requests getting backed up. As soon as the requests were completed, we downloaded the logs looking for the long request, only to come up empty handed.

Since one of the servers was now using Mongrel, we installed the mongrel_proctitle plugin. It allows me to monitor what the processes are doing while hanging from the top command:

mongrel_rails [5000]: handling GET help.tenderapp.com/login                                                        
mongrel_rails [5001]: handling GET customer.tenderapp.com/support.php                                                                     
mongrel_rails [5002]: handling GET customer.tenderapp.com/posts.rss                                                                            
mongrel_rails [5003]: \m/ (no request)

As a result, we started to see random URLs from other support systems. These looked like requests from various forums or support software that predated their Tender Support sites.

After deploying an update to optimize the 404 handling, the queues have stayed empty and response times have returned to normal.

img

This particular problem was more difficult to track down because the server logs neglect to show the benchmarked request time for exceptions.

Also, since a lot of our traffic comes from public sources, we tend to get a lot more of these requests than with private apps like Lighthouse.

The next steps…

In the short term, we have resolved the current issue and have more hardware on the way for tomorrow to give us a little more leg room.

We’ll also be running additional tests to find exactly why things were so slow, as well as finding and eradicating some odd caching bugs that have been cropping up from the extra measures taken over the weekend.

In addition, we’ve also started preliminary planning with Engine Yard for a major hosting change. If we have a more flexible solution in place, we could just turn the dial up to 11 to keep things steady while we diagnose the problem.

Make your voice heard

Sorry, but comments are closed for this item.