The adventures of scaling, Stage 4 April 3rd
What is this series about?
While a couple of high-traffic sites are being powered by Rails and while the Rails book has a handful of instructions to scale your application, it was apparent for us that you’re on your own at a certain point. This series of articles is meant to serve more as a case study as opposed to a generic “How To Scale Your Rails Application” piece of writing, which may or may not be possible to write. I’m outlining what we did to improve our applications’ performance, your mileage may obviously vary.
Our journey is broken up into 4 separate articles, each containing what a certain milestone in scaling the eins.de codebase was about. The articles are scheduled for posting a week apart from the previous.
Stage 4 is the last part of the scaling series containing last polishing steps, a summary of what helped and what didn’t, as well as a look at future optimization plans.See also:
- The adventures of scaling, Stage 1
- Questions and answers for Stage 1
- The adventures of scaling, Stage 2
- The adventures of scaling, Stage 3
(Click the link for the steps taken in stage 4.)
Stage IV, Fast and stable
A lot had happened between November 2005 and March 2006. A lot of optimizations have been put in place, a few workarounds had to be setup (like the dispatcher monitoring mentioned in the last stage). But finally, over a period of a few weeks, the site had proven stable and fairly quick. Plus, we’ve been able to implement a few feature requests coming in from both users and community operators.
In February, a few small tweaks helped to further polish the finish of the site and its performance.
First, we’ve been getting rid of the Ajaxy live-previews when writing personal messages and forum posts etc. While it wasn’t clearly a performance hog, it made sense to remove it in favour of an on-demand replacement to lighten the load of the site even further. Oh, and the AOL browser tends to crash with the prototype observers.
Additionally, lighttpd got a daemontools treatment. While crashes became rare around versions 1.4.8 and later, it’s still better to have something supervise a process of this importance. If lighty dies, the site dies. So better watch it.
Getting lighttpd to run within daemontools is simple enough. After the usual setup (which is perfectly described elsewhere) you setup the lighttpd service in your
/service tree with a one-liner
run script that you’ll know and love from Rails’ original implementation of
#!/bin/sh /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
This gets you up and running. To rehash lighttpd’s configuration you simply send the process ID a
SIGINT signal to terminate gracefully and supervise immediately restarts with the new configuration. Please be aware, however, that if your site gets a lot of traffic you might have to send lighty a
SIGKILL as it’ll never finish serving requests.
By way of lighttpd, with the release of lighttpd 1.4.11 dispatcher hangs seem to have gotten fewer and fewer and might actually be gone for good. We’ll keep our monitoring script running though.
For this series, this is the final stage diagram. This setup serves 1.2M page impressions (100GB traffic) a day:
Summary and future plans
Here’s a summary of 4 months of optimizing effort or rather of those points that have been proven worthwhile:Systems optimization
- Use Linux 2.6 instead of 2.4
- Use self-compiled Ruby 1.8.4 instead of all else
- Use MySQL-supplied binaries
- Use lighttpd 1.4.11 instead of all else
- Use memcache-client instead of Ruby-MemCache
- Use a smaller amount of dispatchers
- Watch your dispatchers
- Avoid components
- Use memcached to store expensive computations
- Use memcached for sessions
- Don’t use live-previews if your site is popular
- Use the exception notification to be aware of raised exceptions
Don’t let that summary fool you. There’s no warranty that your site’s going to handle 2M page impressions a day if you just follow the above. Hell, there’s no warranty our site’s going to handle 1.5M page impressions a day in a few weeks time.
Optimization is an ongoing effort
You have to be constantly monitoring your site, your servers and all the announcements surrounding the software you use.
It’s recommended to not only monitor whether the services are up, but also monitor the load on the servers, the response times, etc. For these jobs the combination of Nagios and Cacti has proven useful for us.
As mentioned, read the changelogs of all the application packages you use to see if new versions either solve existing issues or potentially create new ones. It’s not mandatory to hop on every new version as it’s announced. But it might solve issues that have been plaguing you for weeks. In this context, try to use a staging system if you’d like to avoid downtime caused by upgrades and potential rollbacks to older versions of software you’re using.
Please also be mindful of making drastic changes to your site code. In general, think about what you do. A clever framework like Rails gives you the chance to think, as you’re not typing in dull code repetition all day. Use this time wisely.
An SQL statement or
each loop might be fast on your development laptop, but it might cause the whole site to stall when executed a few thousand times in parallel on what might be a much bigger dataset than in your testbed environment.
In general, it’s not easy to really profile your site.
One alternative would be that you’re in sort of a non-live state where the traffic you’re generating is just not real, it’s not the same as a user (or thousands for that matter) executing their clicking habits. It’s also likely you’re not using the same dataset sitting behind the live site. The numbers you get out of this (possibly using Rails Analyzer Tools or the like) have to be interpreted and put into relation to your live site.
The other alternative would be building profiling steps into your live site. This has the added benefit that you have real users using your code and your systems and as such the data gathered out of such tests is actually worthwhile. The problem is, if your site is particularly busy, your
production.log will fill your harddrives faster than you can say “smorgasboard”.
Apart from the harddrive-filling, relating the actual log messages to each other isn’t exactly easy too. So be sure to redirect log traffic to Syslog (via SysLogger – also available in the Rails Analyzer Tools package) which gives each log message a process ID relation.
Writing huge logfiles also means that your total system IO is going to suffer. Your site actually does a little better than what your profiling will reveal, as you’re typically not writing logfiles of that level of detail in production use.
Oh, and the distraction your users suffer from when profiling your live site is actually for the better — in the end you’re trying to improve their site experience.
The tools used
Apart from the mentioned Rails Analyzer Tools package the tools at hand come with every UNIX-like operating system. You need
Putting it all together requires time, patience and common sense. And an occasional Google search.
What the future holds
Along with the memcache-client library, the Robot Coop released another little library named cached_model which also relies on memcached to relief the database from doing repeated queries by subclassing
ActiveRecord::Base and checking the contents of memcached prior to querying the database.
I’ve been looking at version 1.0 when it came out as it looked promising. It didn’t integrate smoothly at that time and random exceptions were raised that clearly pointed to cached_model. Since we were busy debugging other problems at that time it was clearly not optimal timing to extend the puzzle.
In the meantime cached_model reached version 1.1.0 and numerous fixes went in. This will be the next step on my performance optimization roadmap.
As we’re still suffering from the “hanging dispatcher” problem mentioned in stage 3, we’ll also take a look at alternatives to FastCGI. The more traditional approach would be SCGI which also has native lighttpd support.
The new kid on the block is Mongrel by Zed Shaw. Originally intended as a “better WEBrick”, it evolved into a pure-HTTP alternative to FastCGI that is definitely worth checking out.
In the reader comments to earlier articles Dan Kubb metioned the use of Conditional GET and its potential benefit to have browsers make more use of their cached pages and as such no longer re-render pages that haven’t changed. I have only briefly looked at this subject, but his Rails plugin sure looks promising and easy to integrate.
On a completely different note and although I’ve been advocating the use of MySQL’s FULLTEXT indices for quite a while now, I’m working on a
schema.rb in Rails 1.1.
Speaking of which, an upgrade to the recently released Rails 1.1 will of course be in order. Although not strictly for performance reasons, its additions are welcome in terms of code beauty and DRY.
Thanks for being with me for this whole series (assuming you indeed were).
I truly hope that describing our case in detail saves you from having to do the same mind-boggling research and debugging we had to do to get a better idea of what’s going wrong.