30 Days of Tech: Day 1 - SystemTimer
Since this is the first 30 Days of Tech post, I’ll give you just a bit of background. If you read my earlier posts, it should be obvious that this commitment to 30 Days of Tech is an extension of the “30 Days of Gratitude” than Annie embarked on. Participating in her experiment by trying to post a grateful comment to her blog every day in response to her grateful post had some real, tangible benefits for me. If I’m lucky, similar benefits will come on the technical side by committing to tech posting. So, without any further ado, on with the tech! Since I’m traveling today, we’ll start with something that is pretty easy for me to post.
Philippe Hanrigou mentioned SystemTimer yesterday during his Mongrel talk as a potential solution to what he referred to as the second most common scenario in which Mongrel is hung. Philippe has a great post on the issue and using SystemTimer here, so I’ll just give you a bit of my perspective.
So what’s the problem? On a client project we ran into a situation in production where Mongrel would hang forever waiting for a webservice call to an external system to finish. This was particular frustrating since we had planned for this situation by setting timeouts and implementing fallback strategies for when the webservice failed. Unfortunately for us, the external system was unreliable enough that this had a significant impact.
When we investigated why the timeout and fallback were not happening, we discovered a disconcerting interaction between the os native threads, ruby green threads, and blocking system calls. Ruby’s timeout.rb is implemented using green threads. Basically it starts a homicidal thread charged with taking a nap, and then waking up and killing the working thread. If the working thread can finish it’s work and kill the homicidal thread in it’s sleep, everything is copasetic. If the homicidal wakes up and kills the working thread (by raising an exception), the work is effectively aborted. (Props to Patrick for sharing the homicidal thread metaphor).
When a blocking system call enters the picture, we have a problem. Ruby’s green threads can never be scheduled to run unless the Ruby interpreter’s native thread is scheduled to run by the OS. The OS assumes that as long as Ruby is waiting on a blocking system call, there is no reason to schedule the Ruby native thread—it has no idea that green threads even exist. So if the working thread makes a blocking system call while the homicidal thread is sleeping, the entire Ruby process will be put to sleep by the OS. Since the homicidal thread cannot wake up without OS scheduling the Ruby process, there is no way for it to kill the working thread. In a typical Rails setup, this can add up to very sleepy Mongrels that simply can’t wake up to handle another request… poor Mongrels ;(
Clearly Ruby’s timeout facility was not sufficient to timeout our webservice calls. Philippe and I worked to together to implement a solution based on Kurtis Seebaldt’s original idea. We created SystemTimer, which is a gem with a C extension to Ruby that uses OS level signals to implement the timeout rather than Ruby’s green threads. The homicidal killer metaphor still applies, but instead of launching a green thread SystemTimer asks the OS to deliver a signal after a certain amount of time to indicate that it should go medieval on the working thread. Since the OS will deliver the signal regardless of whether there is a blocking system call, this technique works to overcome the timeout issues we encountered. Using SystemTimer is easy—just check out the info in Philippe’s article.
One can hope you’ll never run into the situation of Ruby’s timeout not working for you. If you do hit this problem, however, I hope that SystemTimer can relieve your pain. We extracted from our application and released to Rubyforge (with the client’s blessing) so that you wouldn’t have to recreate the solution we found. Install it, use it, and let us know what you think!