Saturday, December 10, 2011

Celery 2.3+ crashes when channel.close() commands are issued...

This AMQPChannelException issue has happened for us over the last 3 weeks, so I decided to dig-in to understand why we were getting AMQPChannelException's that caused our Celery workers to simply die. Well, it turns out this exception often appeared in our logs:

(404, u"NOT_FOUND - no exchange 'reply.celeryd.pidbox' in vhost 'myhost'", (60, 40), 'Channel.basic_publish')

The basic problem I suspect is that we have a task that is checking for long running Celery tasks. It relies on the celeryctl command, which is a special mechanism used by Celery to broadcast messages to all boxes running Celery (i.e. celeryctl inspect active)

The celeryctl command implements a multicast protocol by leveraging the AMQP standard (see RabbitMQ tutorial). All Celery machines on startup bind to the exchange celery.pidbox. When you send a celeryctl command, RabbitMQ receives this message, and then goes and delivers to all machines listening to this exchange celery.pidbox.

The machines also send back their replies on a separate exchange called reply.celery.pidbox, which is used by the main process that issued the celeryctl command to collect all the responses. Once all program completes, it will delete the celery.pidbox since it's no longer needed. Unfortunately, if a worker receives this command and attempts to respond but it's too late, it can trigger an exchange not found, causing RabbitMQ to issue a channel.close() command. I suspect this happens especially during heavy loads and/or during intermittent network failures, since the problem often shows up during these times.

Celery handles connection failures fine, but doesn't seem to deal with situations where the AMQP host issues a close command. I solved it in two ways: first, allowing Celery to gracefully reset the entire connection when such an event happens (PR request to the Celery framework), and increasing the window in which we check for replies so the exchange isn't deleted quickly thereafter (i.e. celery inspect active --timeout=60). The latter may be the quicker way to solve it, though the former should probably be something that would help avoid the situation altogether (although it may cause other issues).

The result may appear to trap the exception and try to establish a new connection. This approach is already being used when for Celery control commands (an error msg "Error occurred while handling control command" gets printed out but the underlying Exception gets caught). It seems this exception occurs when the RabbitMQ sends a close() command to terminate the connection, causing the entire process to die.

def on_control(self, body, message):
"""Process remote control command message."""
try:
self.pidbox_node.handle_message(body, message)
except KeyError, exc:
self.logger.error("No such control command: %s", exc)
except Exception, exc:
self.logger.error(
"Error occurred while handling control command: %r\n%r",
exc, traceback.format_exc(), exc_info=sys.exc_info())
self.reset_pidbox_node()

We'll see in this pull-request whether the authors of Celery think this is a good idea...I suspect that it would be better to create a new channel than to restart the connection altogether.

https://github.com/ask/celery/pull/564

No comments:

Post a Comment