Killing me softly: Keeping dispatchers alive February 14th
This is an intermediate publication to my long promised in depth review of me trying to scale a million dynamic page impressions a day on Rails.
When the site in question finally stabilized somewhat, a new problem crept up that I’ve been unable to fully resolve over the past weeks. The net effect is that my FastCGI dispatchers become unresponsive after a while, potentially after a huge traffic spike. Those sit there doing nothing and lighttpd is unable to talk to them.
The site is powered by 4 application servers running 7 dispatchers each and a dedicated lighttpd proxy. After a while, half of those dispatchers are unresponsive and as such no longer serving any requests. The page load times crawl to a halt.
Currently, I’m on Ruby 1.8.4, lighttpd 1.4.10 and Rails 1.0 on Linux 2.6.14.
I’ve tried everything from upgrading Ruby and all gems to debugging potentially exceeded TCP connection limits on my servers to even talking to weigon, the brains behind lighttpd. No avail.
The weird thing is, it doesn’t matter which end I restart, be it the dispatcher *or* lighttpd, everything goes back to normal. That way I cannot even tell for sure that it’s Ruby to blame or my application. It could just as well be lighttpd or my local machine configuration.
Since I was in desperate need of an operational site I whipped up a script to probe all the available dispatchers for responsiveness and kill them with brute-force if they aren’t. I’m using the process scripts, namely the spinner/spawner duo that comes with Rails. As such, the dispatcher is immediately restarted and becomes available for lighttpd to serve to within a couple of seconds.
As this is obviously more of a band aid than anything else, this script is provided as-is, with no claims being made about being functional for anyone else, being pretty, well documented or not eating your cat. You absolutely need Net::SSH installed in order to be able to kill dispatchers not running on localhost. I’m running the script inside of a screen session in order to keep an eye on what’s happening with my dispatchers and how often they get killed. Your mileage may vary.
In case you’re having similar issues with your Rails application, feel free to leave a comment. The script only takes care of dispatchers that are already hung. It is by no means meant as a final cure and I’m more than eager to find out what’s causing the freezes in the first place.
The script is available in the body of this article or as a download here.
#!/usr/bin/env ruby
#
# watch-listener.rb by Patrick Lenz
# THIS SCRIPT IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND
require 'socket'
require 'timeout'
require 'rubygems'
require 'net/ssh'
HOSTS = %w{ 10.10.1.10 10.10.1.11 10.10.1.12 10.10.1.13 }
class WatchListener
attr_accessor :host
def initialize(host)
self.host = host
probe_ports
end
def probe_ports
7000.upto(7006) do |port|
begin
socket = TCPSocket.new @host, port
socket.puts "dummytext\n\ndummytext"
begin
Timeout.timeout(15) { socket.gets }
rescue Timeout::Error
log "%d IS HUNG! RESTARTING..." % port
restart_listener_on port
else
log "%d working fine" % port if ENV['DEBUG']
end
rescue Errno::ECONNREFUSED
log "%d refuses connection"
end
end
end
def restart_listener_on(port)
exec %{
PID=`netstat -a -n -p | grep #{port} | grep LISTEN | \
awk '{print $7}' | cut -d'/' -f1`
echo "killing $PID" && kill -9 $PID
}
end
def exec(command)
log "connecting.."
Net::SSH.start(host) do |session|
input, output, error = session.process.popen3(command)
timeout(20) { log output.read } rescue nil
input.puts "quit"
end
log "done"
end
def log(text)
puts "[%s] (%s) %s" % [ Time.now.strftime("%H:%M:%S"), @host, text ]
end
end
# Main loop
while true do
HOSTS.each do |host|
begin
WatchListener.new host
rescue => error
puts "Exception raised: #{error}"
end
end
print "Sleeping 300 seconds: "
5.times do |i|
print "%d .. " % (300 - i * 60)
STDOUT.flush
sleep 60
end
puts
end

6 comments
Jump to comment form