The adventures of scaling, Stage 3 March 27th
What is this series about?
While a couple of high-traffic sites are being powered by Rails and while the Rails book has a handful of instructions to scale your application, it was apparent for us that you’re on your own at a certain point. This series of articles is meant to serve more as a case study as opposed to a generic “How To Scale Your Rails Application” piece of writing, which may or may not be possible to write. I’m outlining what we did to improve our applications’ performance, your mileage may obviously vary.
Our journey is broken up into 4 separate articles, each containing what a certain milestone in scaling the eins.de codebase was about. The articles are scheduled for posting a week apart from the previous.
Stage 3 contains more memcached best practices, session optimization, and further system optimization techniques.See also:
- The adventures of scaling, Stage 1
- Questions and answers for Stage 1
- The adventures of scaling, Stage 2
- The adventures of scaling, Stage 4
(Click the link for the steps taken in stage 3.)
Stage III, The New Year
With the Christmas and New Year season behind (which is traditionally rather slow traffic wise as there are indeed some people spending time with their families rather than in online communities) we were ready to roll in another set of changes and optimizations to further improve the site performance and responsiveness.
Being unwilling to take out the fancy stuff out of the site code the application du jour was once again memcached. Some research with debugging turned on for our memcache wrapper (which is responsible for automatically keying lookups to community and/or usernames) revealed, that many lookups to memcached would fail. The lookup itself didn’t actually fail though, the instantiation of the objects that were returned from memcached was the part that failed.
What did that mean? Well, expensive computations were cached but retrieving them from the cache was unsuccessful. As such, the computations were recalculated (as a fallback measure from our memcache wrapper). Therefore the time and performance savings weren’t quite as effective as they could’ve been.
This was not, however, related to Ruby class declarations not being available prior to the instantiation. It was apparently related to the marshalled data that was returned. Querying Google for the error message didn’t reveal any obvious culprits and/or solutions either.
Digging through the Debian changelogs revealed some changes in Ruby 1.8.4’s behavior with regards to marshalling and around the same time we noticed the following passage on Rubyonrails.org’s download page
We recommend Ruby 1.8.4 for use with Rails. Ruby 1.8.2 is fine too, but version 1.8.3 is not.
So an upgrade of Ruby was in order. We upgraded to version 1.8.4, recompiled all C extensions such as Ruby-MySQL and RMagick and went live again.
Using this new memcache-client library was much smoother. It even made larger parts of our memcache wrapper (which, apart from the auto-namespacing, mostly served a safety net for exceptions raised from within Ruby-MemCache) obsolete. Three cheers for the Robot Coop, please.
Being happy with the new memcached performance we were to risk another step forward. We moved session storage from ActiveRecordStore (read: MySQL table storage) to memcached. This was supposed to move away another nice share of write requests away from our multi-master-replication setup with the aforementioned constraint of having a single thread to handle all the writes of the opposing master. Coupled with the loss of the token based authentication we were able to bring down the amount of write requests hitting the databases for every page request to about a third of the amount we launched with in November.
Something else we gained from the Robot Coop memcache client was the ability to reasonably distribute storing of memcache keys across multiple servers. Most of our boxes had memory to spare and memcached is very CPU friendly.
So temporarily, we configured all of our boxes to handle memcache connections.
Why temporarily? Well, we had login issues to debug around that time. Something that we were unable to ever reproduce using our own machines. Users sitting in bigger companies behind overzealous firewalls and content filters were unable to login at all.
Further debugging revealed, that they didn’t even see the cookies we sent, which were deliberately set to expire some time in 2010. We even tried different cookie names (in order to avoid foolish assumptions that cookies with “session” in their names are to expire when closing the browser).
Could that have been related to the new multi-box memcached configuration for session storage? Well, noone was able to tell. Nobody actually remembered when their login woes started and we had only recently switched to memcached for session storage (which turned out to be a great relief for the system as a whole).
As a consequence and in order to simplify debugging (and to exclude potential culprits from the equation) we returned to our single box configuration for both memcached and MySQL. memcached got to sit on one of our database servers (with a mostly idle MySQL daemon only serving the adserver and the replicated writes from the other box) and the live site got a dedicated connection to the other database server.
memcached configuration is simple enough btw. The only variable you’re most likely going to change is the amount of memory you allocate for the daemon. Keep in mind, though, that this is a maximum amount of memory. As such, memcached will only grab as much memory as it needs to hold your cached information. It’ll grow to the configured limit over time. We’re currently running at 1024MB of memcache storage, which is plenty for text-only information.
For the statistics folks, based on around 7 weeks uptime (don’t ask about the ratio between bytes read and written, I think that’s just reversed):
get_misses: 59,571,775 get_hits: 235,552,563 total_connections: 2,002,697 bytes_read: 79,799,051,834 bytes_written: 734,299,301,670 curr_items: 1,421,982 total_items: 76,452,455 cmd_set: 76,453,343 cmd_get: 295,124,338 bytes: 717,612,826
The login trouble was solved later on by making cookies expire when closing the browser. For whatever reason. There was no logic whatsoever behind this solution. It just worked and the tradeoff was manageable.
New slowdowns show up
While we were able to top out at 1.1M pageviews a day in the first half of January (with traffic as high as 95G on a single day) the 2nd half of January wasn’t nearly served as well. Albeit all the modifications and tuning we made (which originally made things considerably better) we had new slowdowns show up on the radar.
Bad slowdowns? Really bad. The last week of January was in numbers as bad as early December.
But why? Well, good question. We’ve been optimizing every part of the system (as you know, as you’ve been following along with this series). Things looked good for the past weeks. And all of a sudden, we were thrown back to where we started.
So back to the drawing board it was. Or rather debugging board.
The first findings centered around the fact that the system as a whole was slow, almost unusable at times but load on all servers was low, almost too low. Turning on lighttpd’s
fastcgi.debug revealed that listeners were scheduled to handle connections and then sat there doing nothing. When half the available dispatchers were in such a hung state the site was obviously not as responsive as it could’ve been with all dispatchers available.
(For those of you who seem to recognize this pattern, I’ve been writing about it before in Killing me softly on poocs.net.)
Using tcpdump to monitor the traffic on the listener ports showed.. nothing. Not a single byte crossing the line. Using strace to check what the “stuck” listener is busy doing showed it sitting there in “Waiting..” state. Also doing nothing.
Now the stunning part: If you restart lighttpd or the dispatcher things start working again. In the end, this didn’t indicate either side as being responsible for the hang and we started looking elsewhere.
Having handled various firewall configurations in the other half of my life, I started evaluating tweaking the
/proc parameters of both the application servers and the lighttpd proxy machine in the assumption that something must be hitting its limits. Using netstat also indicated something along these lines as a couple of hundred connections were stuck in
CLOSE_WAIT states, almost as if we were being synflooded or similar. But these were internal servers, not exposed to the outside world.
/proc tweaking, according to various resources publically available .
echo "1024 65535" > /proc/sys/net/ipv4/ip_local_port_range echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle echo 1800000 > /proc/sys/net/ipv4/tcp_max_tw_buckets echo 256000 > /proc/sys/net/ipv4/tcp_max_syn_backlog echo 1024 > /proc/sys/net/core/somaxconn echo "128000 200000 262144" > /proc/sys/net/ipv4/tcp_mem echo 2097152 > /proc/sys/fs/file-max echo 0 > /proc/sys/net/ipv4/tcp_timestamps echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
Don’t try these at home. They didn’t help our cause at all. Dispatchers were still stalling, the site performance was still lousy.
Another attempt to address this issue was to try and setup lighttpd on each application server with local dispatchers (instead of using remote fastcgi listeners we had been using up to this point) and put a load-balancing reverse proxy lighttpd in front of this quartet. You probably guessed it, it didn’t fix anything. Dispatchers hung, requests stalled between the reverse proxy and the local lighttpd instances and things were still slow.
With only a brute-force variant left in the quiver I wrote a script to periodically probe all available listeners for responsiveness and just kill them off if they don’t respond in a certain timeframe. The killed dispatcher is restarted almost immediately by Rails’
spawner duo and lighttpd takes a few additional seconds to reconnect to that socket. But you’re back in business with your full armada of dispatchers in almost no time by constantly monitoring them.
It’s not pretty, but it works and at least got us into February with our new, further simplified setup:
Stay tuned for the last part of the scaling series due for posting on Monday, April 2nd containing last polishing steps, a summary of what helped and what didn’t, as well as a look at future optimization plans.