locking trouble?

Discussion:

locking trouble?

Dave Hrycyszyn

2011-09-05 11:18:19 UTC

Hi,

I'm using xmpp4r on a project, and every few days, Passenger locks up completely, with one or more Ruby processes running at full CPU utilization.

Hitting those Ruby processes with a "kill -6" as described in the Passenger docs generates a handy stack trace. It appears that the Ruby processes are stuck in these lines of code, in xmpp4r's lib/xmpp4r/semaphore.rb file:

# Waits until are available some free tickets
def wait
@lock.synchronize {
@cond.wait(@lock) while !(@tickets > 0)
@tickets -= 1
}
end

Specifically, it's stuck in line 24, @cond.wait(@lock) while !(@tickets > 0).

This is a little nerve-wracking, but as I can't see any references on the net to people having locking problems with xmpp4r, I'm hoping there's a simple fix.

Our application is a Padrino application which accepts incoming form parameters, and as the byproduct of one action, sends off an xmpp message from a method in an activerecord model. There's nothing particularly exotic about it:

def xmpp_notify
jabber = Jabber::Simple.new(sharer.jid, sharer.password)
msg = Message.new(:subject => link.url)
msg.body = message unless message.nil?
recipients.each do |recipient|
jabber.deliver(recipient.jid, msg)
end
end

We can reliably produce locking problems by doing something like ab -c 2 -n 1000 http://my.system.domain/path/to/controller/action

The app also locks up at low levels of non-automated concurrency. Has anyone else ever run across this?

Is there a fix anyone can recommend? Would bumping up Openfire's receive rate help at all?

Regards,
Dave

bulk555-BUHhN+

2011-09-05 17:08:45 UTC

Permalink

Hi Dave,

I get locking issues at the same point. As a quick and ugly fix I simply inserted a sleep just before line 24 and the problem mostly goes away. Weirdly a short sleep (0.1s) is not as effective as a long sleep (1s), faster networks produce less lockups than slower networks, and finally using a single-core VM for the client results in no lockups).

I can't believe we're the only ones as I can reproduce this with nothing more than connect to gchat, openfire or ejabberd, so please let us all know if you find a solution.

Regards,

Alex.

Dave Hrycyszyn

2011-09-06 10:56:57 UTC

Permalink

Post by bulk555-BUHhN+
Hi Dave,
I get locking issues at the same point. As a quick and ugly fix I simply inserted a sleep just before line 24 and the problem mostly goes away. Weirdly a short sleep (0.1s) is not as effective as a long sleep (1s), faster networks produce less lockups than slower networks, and finally using a single-core VM for the client results in no lockups).

Hm, ok, that's bad :|. I will dig around in the implementation today, I guess maybe there are some multithreading issues in there. I always shudder when I write sentences like that, thinking back to Stallman's talks about the development of the GNU/Hurd kernel - "we found out that these multi-threaded processes were kind of hard to debug", and twenty years later it's still being worked on...

Post by bulk555-BUHhN+
I can't believe we're the only ones as I can reproduce this with nothing more than connect to gchat, openfire or ejabberd, so please let us all know if you find a solution.

My needs for this system are pretty minimal, traffic will never be high - I just need things to stay stable and be relatively certain that messages are going to be delivered. I may go with some kind of simple queuing system which will fire off the messages one at a time.

Thanks for letting me know I'm not the only one seeing this behaviour. http://home.gna.org/xmpp4r/ says that bugs can be reported to this list, so for the record, my environment is as follows:

* xmmp4r-0.5 gem
* RHEL 5 Linux
* Apache 2.x running the web app (the clients run inside Passenger)
* Openfire 3.6.4

I've put a gist up at https://gist.github.com/1197232

And also detailed the behavioiur in a Github issue at https://github.com/ln/xmpp4r/issues/23 (I hope it is ok to use the github tracker for this purpose, it's not mentioned in the docs).

I am not sure whether my gist is enough to repro the problem exactly. Running under Thin, xmpp4r hits 100% CPU whenever it can't immediately deliver messages due to network latency or lack of xmpp karma, and CPU stays pinned until all messages are delivered. Once all messages are delivered, it returns to normal again. I may explore setting the test harness Sinatra app on an Apache server running multicore to see if I can reliably get a permanent lockup, as happens on my production app.

Greets,
Dave

Dave Hrycyszyn

2011-09-06 16:54:18 UTC

Permalink

Post by Dave Hrycyszyn
I am not sure whether my gist is enough to repro the problem exactly. Running under Thin, xmpp4r hits 100% CPU whenever it can't immediately deliver messages due to network latency or lack of xmpp karma, and CPU stays pinned until all messages are delivered. Once all messages are delivered, it returns to normal again. I may explore setting the test harness Sinatra app on an Apache server running multicore to see if I can reliably get a permanent lockup, as happens on my production app.

Ok, thanks to astro's comment on the github ticket at https://github.com/ln/xmpp4r/issues/23, I messed around a bit with connection pooling.

Ensuring that connections are re-used results in a massive increase in throughput (my way of saying "sorry for my stupid post") and entirely gets rid of the locking problem. Note to self: xmpp4r-simple spawns a lot of connections and needs to be used sparingly.

A revised gist which uses xmpp4r's native connection handling, instead of xmpp4r-simple's is here: https://gist.github.com/1197857

Dave

Jon Tara

2011-09-06 19:29:03 UTC

Permalink

Post by Dave Hrycyszyn
I'm using xmpp4r on a project, and every few days, Passenger locks up completely, with one or more Ruby processes running at full CPU utilization.

Ruby 1.9.x has a bug related to this. It can deadlock on Mutex#synchronize. See Ruby bug #4266. There is a patch, though the patch is for Ruby head. I successfully applied it to 1.9.2-p290. There is also a patch for a similar bug in monitor.rb (referenced from bug 4266). The monitor.rb patch will probably have to be applied manually, since head has diverged too much from 1.9.2. Both patches move some code from Ruby source to a C extension. In the case of

See if your problem goes away if you use Ruby 1.8.7 or apply the patch to your Ruby 1.9.x.

I had some trouble because I don't know exactly how to add a new extension to Ruby. I had to manually copy the .bundle file (OSX - would be .so on Linux or .dll on Windows) to the proper library locations in src and lib after building the monitor extension. The Mutex code is in thread.rb, and with the patch some code is moved to the already-existing thread.c extension.

There is a test application in the bug report that can be used to verify that the problem exists in your Ruby or that it has been fixed.

Let us know if this patch fixes your problem!

Dave Hrycyszyn

2011-09-07 11:54:50 UTC

Permalink

Post by Jon Tara

Post by Dave Hrycyszyn
I'm using xmpp4r on a project, and every few days, Passenger locks up completely, with one or more Ruby processes running at full CPU utilization.

Ruby 1.9.x has a bug related to this. It can deadlock on Mutex#synchronize. See Ruby bug #4266. There is a patch, though the patch is for Ruby head. I successfully applied it to 1.9.2-p290. There is also a patch for a similar bug in monitor.rb (referenced from bug 4266). The monitor.rb patch will probably have to be applied manually, since head has diverged too much from 1.9.2. Both patches move some code from Ruby source to a C extension. In the case of
See if your problem goes away if you use Ruby 1.8.7 or apply the patch to your Ruby 1.9.x.
I had some trouble because I don't know exactly how to add a new extension to Ruby. I had to manually copy the .bundle file (OSX - would be .so on Linux or .dll on Windows) to the proper library locations in src and lib after building the monitor extension. The Mutex code is in thread.rb, and with the patch some code is moved to the already-existing thread.c extension.
There is a test application in the bug report that can be used to verify that the problem exists in your Ruby or that it has been fixed.
Let us know if this patch fixes your problem!

Hi Jon,

I was actually using 1.8.7 already, so no gains to be made there! I think it was my crappy connection-handling code which was causing the problem, although until I deploy it into production I won't be totally sure. I'm very impressed by the community response, though, I have to say - thanks to everyone for the all the help, it's been an interesting problem.

D