The adventures of scaling, Stage 1 March 13th
What is this series about?
While a couple of high-traffic sites are being powered by Rails and while the Rails book has a handful of instructions to scale your application, it was apparent for us that you’re on your on at a certain point. This series of articles is meant to serve more as a case study as opposed to a generic “How To Scale Your Rails Application” piece of writing, which may or may not be possible to write. I’m outlining what we did to improve our applications’ performance, your mileage may obviously vary.
Our journey is broken up into 4 separate articles, each containing what a certain milestone in scaling the eins.de codebase was about. The articles are scheduled for posting a week apart from the previous.
Our mission was to rewrite the codebase behind the online community network eins.de since the former PHP-based codebase was both bloated and poorly architected. Being an online community site, eins.de has everything you’d expect from such a term: user forums, galleries with comments, user profiles, personal messaging, editorial content, and more. Additionally, eins.de has local partners that are the driving forces behind all of the available sub-communities, mostly forming around the bigger German cities. User interaction is possible globally, as such there’s only a single dataset behind everything.
The old codebase roughly consisted of around 50.000 lines of PHP code (plus a closed-source CMS that’s not included in this calculation). We’ve rewritten most of it (some features were left out on purpose) in about 5.000 lines of Rails code.
eins.de serves about 1.2 million dynamic page impressions on a good day. The new incarnation is serving up the 25 sub-communities on different domains in a single Rails application. It was, however, not before Febuary of this year when our iterative optimizations of both system configuration and application code lead to a point where we were able to deal with this amount of traffic.
The site largely lives through dynamic pages and information rendered based upon user preferences or things like online status or relationship status. This kept us from taking the easy way out by just using page or fragment caching provided by Rails itself.
The application servers are dual Xeon 3.06GHz, 2GB RAM, SCSI U320 HDDs RAID-1. The database servers are dual Xeon 3.06GHz, 4GB RAM, SCSI U320 HDDs RAID-1. The proxy server is a single P4 3.0GHz, 2GB RAM, SCSI U320 HDDs RAID-1.
Without changing the hardware we were able to improve the performance of our setup while still adding features to the application by configuration optimization and changes to the application code.
In numbers: We were maxed out at about 750.000 page impressions per day in November (about 60GB of traffic) and now easily handle 1.200.000 page impressions per day (about 100GB of traffic) in March. That is a 1.6x improvement!
At peak times about 20Mbit/s leave the proxy server’s ethernet interface.
(Click the link for the steps taken in stage 1.)
Update March 18: A follow-up article addressing reader comments is now available here.
Update March 20: Stage 2 is online.
Update March 27: Stage 3 is online.
Update April 03: Stage 4 is online
So, what did you start out with?
Well, you cannot change history. That’s what our configuration was back then. A bit more versioning detail about the diagram above:
- Debian 3.1
- Kernel 2.4.27
- lighttpd 1.4.6
- Ruby 1.8.3 from Debian packages
- MySQL 5.0.16 from Debian packages
- Rails 0.14.3 from RubyGems
- Ruby-MySQL 2.7 from RubyGems
- Ruby-MemCache 0.0.4
The two database servers were replicated in a master-master setup, spacing the auto increment generation apart through
auto_increment_offset (see the MySQL manual for more information).
haproxy was used to balance both the external FastCGI listeners sitting on the application servers as well as the database connection from the dispatchers to the MySQL servers.
Basically, as outlined in the introductory paragraph above, the relaunch performance was a desaster. The old and crufty PHP-based site was able to handle about 900.000 page impressions before it collapsed (that said, it only had half the number of application servers as well) and the newly architected one fell over at a whopping 150.000 page impressions less. Not the turnaround you’d have hoped for. Even less so after spending days and nights programming. Good thing the “I cannot deal with change”-mob had different things to worry about.
The emergency plan
Yes, we’ve been pondering cashing our checks and fly to the Bahamas. We stayed.
As a first measure the number of FastCGI listeners was decreased from 20 to 10. To be honest, with the old setting the site was truly unusable. Pages would start to load but stall every once in a while having boatloads of disappointed and grumpy users hitting reload on us making things even worse. With the new setting, things came down a bit, pages loaded albeit everything but quickly.
Over the next few days after the relaunch we’ve taken additional measures to improve performance and fix little issues that haven’t crept up in private testing. Sleep was a rare good.
A couple of things we did to put out the fire, with varying degrees of success:
- Rip out haproxy as it introduced yet another variable that could be tweaked and the immediate benefit of using it wasn’t really obvious. MySQL connections of all application servers were statically configured to connect to a single MySQL host. The distribution of the FastCGI connections was handed back to lighttpd. Tip: We found that in order to really have equally loaded application servers you should order your fastcgi.server directives by port and not by host, like so:
"http-1-01" => ( "host" => "10.10.1.10", "port" => 7000 ), "http-2-01" => ( "host" => "10.10.1.11", "port" => 7000 ), "http-3-01" => ( "host" => "10.10.1.12", "port" => 7000 ), "http-4-01" => ( "host" => "10.10.1.13", "port" => 7000 ), "http-1-02" => ( "host" => "10.10.1.10", "port" => 7001 ), "http-2-02" => ( "host" => "10.10.1.11", "port" => 7001 ), "http-3-02" => ( "host" => "10.10.1.12", "port" => 7001 ), "http-4-02" => ( "host" => "10.10.1.13", "port" => 7001 ),
- Play with fragment caching although it introduced inconveniences for the users (stalled data, no longer personalized) — no improvement, changes were reverted at a later time.
- Back out of the idea of using two memcached hosts simultaneously, as the Ruby-MemCache library apparently doesn’t handle that too well. Things got distributed not on a per-key basis but randomly, giving us headaches about distributed expiration of dirty keys.
- Refactoring of sidebar code which was originally written as a component — talking to bitsweat revealed that they’re a performance killer. You basically setup yet another full controller environment for each sidebar you render. Yes, that one was obvious. (See RailsExpress if you need more convincing.)
- Add gzip compression as an
after_filter(based on the examples in the Rails book)
- Identified various slow queries in the MySQL slow query log and refactored the culprits by eliminating joins, optimizing index columns, etc. (This is obviously not Rails specific.)
This got us into December at least, to the point where we were able to handle 850.000 page impressions a day, still hardly something you’d put a sticker labeled “easily” on though.
Our new, simplified setup was as follows:
Stay tuned for the second part of the scaling series due for posting on Monday, March 20th containing MySQL tuning tips, tuning of FastCGI dispatchers, and further system optimization techniques.