Sunday, January 30, 2011

RabbitMQ resource management talks

Friday, January 28, 2011

Integration PyFlakes into Hudson...

There are a few web links out there that discuss how one can add PyFlakes to your Hudson integration. There's also a PyFlakes Hudson plug-in but then you'd have to re-compile the plug-in into the main Hudson code base. An alternative (as mentioned in and is that you can use the Violations Plug-in and coerce the output of the pylint into a format that the pylint parser expects. You can then provide the path of this output in place of where the pylint violations are usually recorded.

The command line used in the line can be used:
pyflakes [path_to_src] | awk -F\: ‘{printf “%s:%s: [E]%s\n”, $1, $2, $3}’ > pyflakes.txt
How does it work? To best understand, you need to checkout the violations plugin:
svn co (see
Within the plugin/violations/types/pylint dir, the regex inside is listed as follows:
     * Constructor - create the pattern.
    public PyLintParser() {
        pattern = Pattern.compile("(.*):(\\d+): \\[(\\D\\d*).*\\] (.*)");
Keep in mind that in Java, the regexp's require two backslashes. Therefore, \d is actually expressed as \\d. The regex pattern that needs to be generated is basically:
re.compile("(.*):(\d+): \[(\D\d*)\] (.*)")
We can see why the different message types are allowed. Right now all results can be used to either generate warning or errors.
     * Returns the Severity level as an int from the PyLint message type.
     * The different message types are:
     * (C) convention, for programming standard violation
     * (R) refactor, for bad code smell
     * (W) warning, for python specific problems
     * (E) error, for much probably bugs in the code
     * (F) fatal, if an error occured which prevented pylint from doing
     *     further processing.
     * @param messageType the type of PyLint message
     * @return an int is matched to the message type.
    private void setServerityLevel(Violation violation, String messageType) {

One undocumented feature in the Hudson Violations plug-in is that after each build, it creates a violations.xml file and violations/file directory containing each of the individual files with violations. These XML files are generated from the PyFlakes output and saved into the corresponding subdirectory and file.
File xmlFile = new File(
            MagicNames.VIOLATIONS + "/" + MagicNames.VIOLATIONS + ".xml");
        try {
            model = new BuildModel(xmlFile);
                xmlFile, new BuildModelParser().buildModel(model));
        } catch (Exception ex) {
            LOG.log(Level.WARNING, "Unable to parse " + xmlFile, ex);
            return null;

        modelReference = new WeakReference(model);
        return model;

For instance, we would have a violations/file/modulename/ here:
  <type name='pylint'>
      count = '2'>
      <severity level="0" count="2"/>
      <source  name="E1" count="2"/>
If you see "No violations found" when drilling into the file on the Hudson control panel, it's most likely that your search includes a relative path "./" that may be triggering an issue. Since Hudson depends on using URL matching, the "./" could be causing issues for the Violations plug-in to locate the appropriate XML file. The following sed expression may help to remove these "./" paths.
pyflakes <dir> | sed 's/^[./]*//' 

regexp's in Java...
Regular Expressions In JAVA
The JAVA class library provides two classes to support regexes in JAVA, namely Pattern and Matcher from the java.util.regex package. Pattern simply stores a regex that we want to match strings against. We can get a Matcher object by matching a Pattern against a String, and then use the Matcher object to see if the pattern matched, extract captures, etc.
An Annoying JAVA Issue
Many regexes use sequences such as \d to match a digit. As in many programming languags, a regex in JAVA is represented as a string. This is a problem because "\d" is seen by the JAVA compiler as a JAVA escape sequence, and the compiler will complain that it doesn't understand the escape sequence \d. Therefore, it is important to add extra backslashes, e.g. "\\d". Be careful - in the case of \b this will go quietly un-noticed at compile time, then your regex won't work out as you expect at runtime.

is: visible in jQuery resolves to true if visibility:hidden....
Elements with visibility: hidden or opacity: 0 are considered to be visible, since they still consume space in the layout. During animations that hide an element, the element is considered to be visible until the end of the animation. During animations to show an element, the element is considered to be visible at the start at the animation.

Tuesday, January 25, 2011

Install gstack.

Ubuntu AMD64 doesn't come with pstack/gstack, but it can be downloaded from other RPM distributions. Here are the basic instructions that I got to work:

1. wget

2. sudo apt-get install rpm2cpio

3. sudo apt-get install gdb

4. rpm2cpio gdb-7.1-34.fc13.x86_64.rpm | cpio -idmv

These files will then be output in the current present dir -- you can then go 

5. Edit the gstack file to use #!/bin/bash instead of #!/bin/sh

Upgrading Celery v1.06 to Celery v2.1.4 (or any specific version)..

1. pip install --upgrade celery==2.1.4
2. Determine where old versions of celery/carrot are being stored:
import celery
import carrot
3. If the files are .egg files, rename or remove them.
4. You should restart your daemons (i.e. apscheduler) that relies on Celery to avoid Python import clashes.

If you have upgraded Celery but not Carrot, you may see this message:

Value: send() got an unexpected keyword argument 'exchange'

Stack trace:
 File "/mydirs/django/core/handlers/", line 100, in get_response
   response = callback(request, *callback_args, **callback_kwargs)

 File "/home/mydir/tasks/", line 86, in finish

 File "/usr/local/lib/python2.6/dist-packages/celery/task/", line 348, in delay
   return self.apply_async(args, kwargs)

 File "/usr/local/lib/python2.6/dist-packages/celery/task/", line 364, in apply_async
   return apply_async(self, args, kwargs, **options)

 File "/usr/local/lib/python2.6/dist-packages/celery/", line 294, in _inner
   return fun(*args, **kwargs)

 File "/usr/local/lib/python2.6/dist-packages/celery/execute/", line 106, in apply_async
   expires=expires, **options)

 File "/usr/local/lib/python2.6/dist-packages/celery/", line 99, in delay_task

Celery v2.0 versus Celery v1.06

Celery v1.06 doesn't have support for the queue= keyword. You may try to dispatch a call (i.e. function.apply_async(kwargs={'arg1' : arg1}, queue='queue-low-priority'), but it may be still sent through the default queue. The issue is that v1.06 doesn't have support for this keyword argument. The solution? Upgrade to Celery v2.0!

extract_exec_options = mattrgetter("routing_key", "exchange",
                                   "immediate", "mandatory",
                                   "priority", "serializer",

celery/execute/ (v1.0.6):
extract_exec_options = mattrgetter("routing_key", "exchange",
                                   "immediate", "mandatory",
                                   "priority", "serializer",
celery/execute/ (v2.0.0):
extract_exec_options = mattrgetter("queue", "routing_key", "exchange",
                                   "immediate", "mandatory",
                                   "priority", "serializer",
New Task option: Task.queue

If set, message options will be taken from the corresponding entry inCELERY_QUEUES. exchange, exchange_type and routing_key will be ignored

Python APScheduler does not follow crontab conventions...

If you've ever used Python's APScheduler, you may try to specify the day of week for jobs to trigger and wonder why it doesn't fire on the scheduled date/time. Naturally you might look at the crontab format to decide how day of week is indexed.

*     *     *   *    *        command to be executed
-     -     -   -    -
|     |     |   |    |
|     |     |   |    +----- day of week (0 - 6) (Sunday=0)
|     |     |   +------- month (1 - 12)
|     |     +--------- day of        month (1 - 31)
|     +----------- hour (0 - 23)
+------------- min (0 - 59)

If you look inside apscheduler/, the days are indexed from 0-6 with Monday representing the first day. See

WEEKDAYS = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']

The accompanying regex expressions are also supported:
value_re = re.compile(r'(?P[a-z]+)(?:-(?P[a-z]+))', re.IGNORECASE)

value_re = re.compile(r'(?P

They correspond to the following:
Expression types
The following table lists all the available expressions applicable in cron-style schedules.

Expression Field Description
* any Fire on every value
*/a any Fire every a values, starting from the minimum
a-b any Fire on any value within the a-b range (a must be smaller than b)
a-b/c any Fire every c values within the a-b range
xth y day Fire on the x -th occurrence of weekday y within the month
last x day Fire on the last occurrence of weekday x within the month
x,y,z any Fire on any matching expression; can combine any number of any of the above expressions

Unsubscribing from MyBo?

Unsubscribing from specific mailing groups doesn't work, but this link seemed to do the trick:

Monday, January 24, 2011

Difference between /usr/env/bin and /usr/local/bin/python...


You are specifying the location to the python executable in your machine, that rest of the script needs to be interpreted with. You are pointing to python is located at /usr/local/bin/python

Consider the possiblities that in a different machine, python may be installed at /usr/bin/python or /bin/python in those cases, the above #! will fail. For those cases, we get to call the env executable with argument which will determine the arguments path by searching in the $PATH and use it correctly.

#/usr/bin/env python

We figure out the correct location of python ( /usr/bin/python or /bin/python from $PATH) and make that as the interpreter for rest of the script. ( env is almost always located in /usr/bin/ so one need not worry what is env is not present at /usr/bin)


The Celeryd documentation only includes examples of how celeryd-multi should be invoked, so here
they are:

1. You can use this init script as a template:

Or you can invoke celeryd-multi the old-fashioned way:

a. You need a /var/run directory for celery. The pidfile depends on them.
b. You also need to create a /var/log/celeryd dir too.

2. If you type:
celeryd-multi show 2 -Q:1-1 default -Q:2-2 nightly --loglevel=INFO --pidfile=/var/run/celery/${USER} --logfile=/var/log/celeryd.${USER}%n.log
> Starting nodes...
celeryd -Q celery,default --pidfile=/var/run/celery/ -n celery1.myhost-dev-0 --loglevel=INFO --logfile=/var/log/celeryd.me1.log
celeryd -Q nightly --pidfile=/var/run/celery/ -n celery2.myhost-dev-0 --loglevel=INFO --logfile=/var/log/celeryd.me2.log can see how things will get invoked.

To start, you would do:
celeryd-multi start 2 -Q:1-1 celery,default -Q:2-2 nightly --loglevel=INFO --pidfile=/var/run/celery/${USER} --logfile=/var/log/celeryd.${USER}%n.log

To stop:

This should start up two separate workers, one for celery, default and one for nightly. The celeryd aemon will fork up to # of CPU's, so if you have 4 CPU's per machine, you could have a total of 8 celery workers running. both cases, CELERY_CONFIG_MODULE needs to be set.

Then you setup a queue in your Celery config similar to the following:

    "default" : {
        "exchange" : "default",
        "binding_key" : "default" },
    "nightly" : {
        "exchange" : "nightly",
        "binding_key" : "nightly",
        "exchange_type" : "direct"


RabbitMQ and the delay function..

When you call delay() on your tasks, Celery will publish a message to RabbitMQ with this task info. All of these messages that get thrown into a default queue ('celery' by the original configuration, and Celery workers listening to these queues will start to process them. You need to have celeryd running in order for these workers to be running. By default, celeryd will also search for the celeryd queue.

In addition, Celery also publishes messages to record the result/statuses of these messages onto your vhost into a RabbitMQ exchange called 'celeryresults', and then creates a temporary queue with the task ID (without the dashes). If you set the CELERY_RESULT_PERSISENT to transient (by default, it's set this way ), then all of these messages will be stored in-memory into this latter queue. If you set it to persisent, then all of these messages get stored both in-memory and on disk. Neither scenario works great if you have a lot of tasks getting dispatched each night. Normally Celery is supposed to set 1-day expiration of these result queues (it can be decreased too by TASK_RESULT_EXPIRES), but I don't see anywhere in the code that actually sends periodic instructions to delete these queues unless explicitly directed to do so.

What ignore_result does is simply not publish these status messages into a temporary queue. Then no messages are created, and no additional memory gets consumed. Obviously you can't get back the results if ignore_result=True, and doing a get() on a task with this flag set will just make your code block forever. The GitHub code at shows that messages were not even being purged during apply_async() if use did a get(), so queues would not have been deleted even if we had attempted to get the result back from tasks that we dispatched.

If you are dispatching tasks within tasks and don't want result messages to be generated, you can either set CELERY_IGNORE_RESULT to be True. However, if you still want Celery results to be set here are things that you can do:

1. Add @task(ignore_result=True) to your task definition instead of @task.

2. Add an options={} to the task call function call as a keyword argument.

 def test_myfunction_task(self, options={}):
3. Most of the you tasks should be changed from .delay() to .apply_async() Your task will then be equipped to send to different queues.

 def test_group_task(self, options={}):
        """ Checks graph profile crawler """
        individual_task.apply_async(kwargs={'arg1' : arg1,
                                    'options' : options},
a. FYI - The delay() function is simply a wrapper for apply_async, but apply_async() has extra options but you aren't allowed to specify other options such as which queues to send (see

b. You need to wrap all keyword arguments into a kwargs={} dictionary. Celery will unwrap this dictionary and pass in **kwargs into your function.

c. You'll notice two parameters: options get passed into kwargs, and we also unwrap **options into the apply_async. If we weren't dispatching more tasks within tasks, then we wouldn't need to pass 'options' into kwargs.

Another FYI -- it turns out that Celery injects a bunch of other keyword arguments into each task call (i.e. task_name), so if you try to pass **kwargs into nested tasks, calling apply_async will also inject task_name again, so not only does 'task_name' live inside **kwargs, but also it gets passed as another keyword argument into the apply_async call too:

def test_group_task(arg1, options={}, **kwargs):
  individual_task.apply_async(**kwargs, **options) -> fail (equvalent to something like   apply_sync({'task_name' : 'orig_func}, task_name={'new func'})

Sunday, January 23, 2011

More bug fixes in RabbitMQ

- queue.declare and exchange.declare raise precondition_failed rather
  than not_allowed when attempting to redeclare a queue or exchange
  with parameters different than those currently known to the broker

Bug in camqadm -- too many values to unpack

Apparently this GitHub issue in camqadm prevents you from declaring exchange's beyond the defaults of durable=False, auto_delete=False, internal=False

passive: the exchange will not get declared but an error will be thrown if it does not exist.
durable: the exchange will survive a broker restart.
auto-delete: the exchange will get deleted as soon as there are no more queues bound to it. 
Exchanges to which queues have never been bound will never get auto deleted.
camqadm queue.declare myqueue no yes no no Parameter exchange.declare.internal (bit)
Ordinal: 7
Domain: bit
Label: create internal exchange
If set, the exchange may not be used directly by publishers, but only when bound to other exchanges.
Internal exchanges are used to construct wiring that is not visible to applications.

Where AMQP queues & exchanges are defined...

    def declare(self):
        """Declares the queue, the exchange and binds the queue to                                                                                                                                                                           
        the exchange."""
        arguments = None
        routing_key = self.routing_key
        if self.exchange_type == "headers":
            arguments, routing_key = routing_key, ""

        if self.queue:
            self.backend.queue_declare(queue=self.queue, durable=self.durable,
        if self.queue:
        self._closed = False
        return self

Friday, January 21, 2011

Using regex with grep..

grep -E '2011-01-2[0-1]'

The 'E' denotes POSIX regular expressions.

git rm --cached

Use only to remove new files from the staging area (and only in case of a new file):

git rm --cached FILE
Use rm --cached only for new files accidentally added.

Otherwise, use git reset HEAD .....

RabbitMQ talks

Three valuable links to understanding RabbitMQ and Celery:

In Celery the routing_key is the key used to send the message, while binding_key is the key the queue is bound with. In the AMQP API they are both referred to as the routing key.

Celery automatically creates the entities necessary for the queues in CELERY_QUEUES to work (except if the queue’s auto_declare setting is set to False). This statement implies then it will do all the work to talk to the RabbitMQ server to create the defined exchange name.

Useful commands:
camqadm queue.purge 
celeryctl inspect active
rabbitmqctl list_queues name memory
rabbitmqctl -p  list_queues
rabbitmqctl -p  list_exchanges

The non-AMQP backends like ghettoq does not support exchanges, so they require the exchange to have the same name as the queue. Using this design ensures it will work for them as well.

Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.

Sansa Clip+ MP3 clip and audio books..

The Sansa Clip+ 2GB MP3 player is a small and extremely portable, but one of its quirks is that it has two USB transfer modes called MSC and MTP. MTP appears to be a haphazard attempt to provide DRM support for Windows Media Player, and MSC mode has to be manually set in order to access the MP3 player in Ubuntu Linux.

One issue not reported about this device is why files copied through MTP mode have a special MTP folder that can be accessed, which is the only way you can get files to auto-advance. In other words, copying files through MSC mode, which is how one can copy files via Ubuntu, works but the MP3 files don't advance to the next one. If you're listening to audio books, this issue can be a royal pain. The workaround right now is to copy files via MTP mode (i.e. on Windows), and access this MTP folder to play.

Sandisk seems to have issues in their firmware in allowing files copied via MTP mode to sequence automatically....

Thursday, January 20, 2011

Remote debugging Celery tasks...

Use import pdb; pdb.set_trace()? Well if you've ever tried to debug Celery task queue processes, apparently there's always an rdb import that lets you debug processes as they run:

How Django deals with Unicode

If you've read the Django documentation about Unicode, it reads something like the following:

All of Django’s database backends automatically convert Unicode strings into the appropriate encoding for talking to the database. They also automatically convert strings retrieved from the database into Python Unicode strings. You don’t even need to tell Django what encoding your database uses: that is handled transparently.

So what's happening internally for CharField types? Well, it turns out within the fields, smart_unicode() is invokved on the field, converting the value back to a Python unicode type (through get_prep_value() The entire SQL query gets generated as a Unicode object, so then at the very end, if we're using a MySQL back-end, encode(charset) is invoked on the query:


def execute(self, query, args=None):

        """Execute a query.
        query -- string, query to execute on server
        args -- optional sequence or mapping, parameters to use with query.                                                                                                                

        Note: If args is a sequence, then %s must be used as the
        parameter placeholder in the query. If a mapping is used,
        %(key)s must be used as the placeholder.                                                                                                                                           

        Returns long integer rows affected, if any                                                                                                                                        

        charset = db.character_set_name()
        if isinstance(query, unicode):
            query = query.encode(charset)

bash script to check disk space

Need a script to check whether a certain drive is out of space or hits a certain threshold?

# Extract the disk space percentage capacity -- df dumps things out, sed strips the first line,
# awk grabs the fifth column, and cut removes the trailing percentage.
DISKSPACE=`df -H /dev/sda1 | sed '1d' | awk '{print $5}' | cut -d'%' -f1`

# Disk capacity threshold
if [ ${DISKSPACE} -ge ${ALERT} ]; then
    echo "Still enough disk space....${DISKSPACE}% capacity."

Wednesday, January 19, 2011

Regular expressions and using find

The find Unix command has the ability to do searches with the different regular expression formats, but how does one actually use it? What if we want to search for a specific pattern? (i.e. 2010-01-

I decided to try out a few combinations:

# Match against the full date/time
find ${MYDIR} -maxdepth 1 -mindepth 1 -regextype posix-egrep -regex ${MYDIR}[0-9]{4}-[0-9]{2}-[0-9]{2}.* -type d -print
# For POSIX awk types, we must use [[:digit:]] to match the first 4 digits.  We can match with {4}
find ${MYDIR} -maxdepth 1 -mindepth 1 -regextype posix-awk -regex ${MYDIR}[[:digit:]]{4}.* -type d -print
# Emacs regex (default) can only support [0-9] and can't do multiple matches (i.e. {4})?
find ${MYDIR} -maxdepth 1 -mindepth 1 -regex ${MYDIR}[0-9].* -type d -print
# Extended regexp: we can also use [0-9], and use {4}
find ${MYDIR} -maxdepth 1 -mindepth 1 -regextype posix-extended -regex ${MYDIR}[0-9]{4}.* -type d -print

The link below summarizes all the various differences between the regular expressions:

According to the document, apparently \d matching can only be done in Python, Perl, and Tcl. We use egrep if we want to specify multiple matching characters {4} (instead of \{4\}) in regular grep. Within Emacs, multiple matching works but at the command-line with find it doesn't seem to work.

M2Crypto and Facebook's SDK hangs...

Another issue is that even if I fixed the issue, the Facebook SDK still hangs when trying to read a response with urllib.urlopen() and M2Crypto::
import M2Crypto
import urllib
print urllib.urlopen(("" + urllib.urlencode({'access_token': [your access token here]'}).read()

...but this works:
import M2Crypto
import urllib

urllib._urlopener = urllib.FancyURLopener()
urllib._urlopener.addheader('Connection', 'close')
u = urllib.urlopen("" + urllib.urlencode({'access_token': [your access token here]'}))
data =
print data

The urllib._urlopener code is essentially what is done by urllib.urlopen(). The major difference is that we add a Connection: close header.

If the Facebook SDK code would just change to use urllib2 instead of urllib2, the issue goes away (most likely because M2Crypto hasn't been hijacking the code). The urllib2 automatically adds a Connection: close header (see
import M2Crypto
import urllib
print urllib2.urlopen(("" + urllib.urlencode({'access_token': [your access token here]'}).read()

M2Crypto and the Facebook Python SDK...

While doing a basic Facebook Graph API graph.get_object("me") call, (See, I noticed the following stack-trace:
Traceback (most recent call last):
    profile = graph.get_object("me")
  File "/home/projects/external/", line 88, in get_object
    return self.request(id, args)
  File "/home/projects/external/", line 173, in request
    urllib.urlencode(args), post_data)
  File "/usr/lib/python2.6/", line 87, in urlopen
  File "/usr/lib/python2.6/", line 206, in open
    return getattr(self, name)(url)
  File "/usr/lib/pymodules/python2.6/M2Crypto/", line 58, in open_https
  File "/usr/lib/python2.6/", line 892, in endheaders
  File "/usr/lib/python2.6/", line 764, in _send_output
  File "/usr/lib/python2.6/", line 723, in send
  File "/usr/lib/pymodules/python2.6/M2Crypto/", line 50, in connect
    self.sock.connect((, self.port))
  File "/usr/lib/pymodules/python2.6/M2Crypto/SSL/", line 172, in connect
    if not check(self.get_peer_cert(), self.addr[0]):
  File "/usr/lib/pymodules/python2.6/M2Crypto/SSL/", line 61, in __call__
    raise NoCertificate('peer did not return certificate')
NoCertificate: peer did not return certificate

What was the cause of the NoCertificate? Why did the issue occur periodically? The mystery deepened when I traced things down to the M2Crypto library. It turns out that if you import the M2Crypto library before running Facebook Graph API call, M2Crypto will do two things:

1) First, it will set urllib2 to M2Crypto.m2urllib in the file of the M2Crypto package, thus making the M2Crypto urllib2 library when attempting to import urllib2 (see M2Crypto/
# Backwards compatibility.                                                                                                                                                                                                                   
urllib2 = m2urllib

2) Second, it will modify the urllib open_https() call to use the M2Crypto open_https() call, replacing the standard HTTPS opener with its own (see M2Crypto/
from urllib import *
# Minor brain surgery.                                                                                                                                             
URLopener.open_https = open_https
You can verify this issue by commenting/uncommenting the last lines before invoking a Facebook Graph API call:

import urllib
orig = urllib.URLopener.open_https
import M2Crypto.m2urllib
urllib.URLopener.open_https = orig   # uncomment this line back and forth

In theory, M2Crypto with urllib should work fine. M2Crypto is meant to replace the https connection to provide features such as SSL certificate validation (the urllib that comes with the Python 2.6 code does not -- see You can do the following to verify that M2Crypto should work to replace urllib without any glitches:
import M2Crypto
import urllib

urllib.urlopen("" + urllib.urlencode({'access_token': '[insert your token here']}), None)
After more investigation, I found that I could reproduce the issue when using socket.setdefaulttimeout(), which we do in other parts of our code to extend the timeout of our socket connections.
import M2Crypto
import urllib

import socket
urllib.urlopen("" + urllib.urlencode({'access_token': '[insert your token here']), None) will see the "no peer certificate" error. It turns out to be a known bug in the M2Crypto library ( The bug is also listed at

The offending code is in the M2Crypto/ code, which appears to use the self.blocking flag to determine whether to be in blocking/non-blocking mode:
def __init__(self, ctx, sock=None):
    self.blocking = self.socket.gettimeout()

  def write(self, data):
        if self.blocking:
            return self._write_bio(data)
        return self._write_nbio(data)
    sendall = send = write

    def read(self, size=1024):
        if self.blocking:
            return self._read_bio(size)
        return self._read_nbio(size)
    recv = read

Another good workaround is here:

Also, I tried to install the latest version posted at
sudo apt-get install swig
pip install --upgrade M2Crypto

...and the issue still seems not to have been resolved. It seems as if M2Crypto has this problem but despite a few proposed fixes nothing has yet to be integrated. So for the time being, the only way to resolve it without patching your own code is to avoid using settimeout() when performing urlopen's with M2Crypto libraries imported.

FYI -- if you import Google's GData Python code, it will use the M2Crypto library too. So even if you think you're not using it, some other library may be importing it!

Tuesday, January 18, 2011

What BASH_SOURCE does..

What does BASH_SOURCE do? The documentation at is really vague, so you really need to try it.

echo ${BASH_SOURCE[0]}

echo ${BASH_SOURCE[0]} ${BASH_SOURCE[1]}

/tmp$ source
rhu@rhu-linux:/tmp$ source

Monday, January 17, 2011

Hudson / Git plug-in issues

Update: This issue has now been resolved in the GitHub repository.

One of the issues I've noticed with the Git plug-in for Hudson (Git Hudson plug-in v1.1.4) is that the wrong email address is used for committers. The email address used is the person's full name, rather than the email address listed on the git commit (i.e. John This issue was reported on Hudson's site but there was no resolution (in spite it being an open ticket since August 2010), so I decided to look into it.

The main error message that gets reported is "Illegal whitespace in address in string". What appears to be happening is that Hudson is using the fullname of the GitHub committer for the email prefix address (i.e. John instead of the Git committer/author address:
Incorrect mail address used for sending email notifications

I managed to trace the issue to from the Git plug-in and the Hudson files. The quick short-term fix patch is simply to invoke user.addProperty() regardless of whether user.getProperty(Mailer.UserProperty.class) has already been set. You can review the diff from this Github fork:

You can recompile the Hudson plug-in by downloading Maven2, changing, and doing a "mvn build". You then shutdown Hudson, copy the target/git.hpi into your plugins/ dir, rename/remove the current expanded plugin dir, and create a .pinned file to inform Hudson not to replace the .hpi build when you reset:
Deploying a custom build of a core plugin
Stop Hudson
Copy the custom HPI file to $HUDSON_HOME/plugins
Remove the previously expanded plugin directory
Create an empty file called .hpi.pinned - e.g. maven-plugin.hpi.pinned
Start Hudson
The following diff worked for me. The old code in the Git plug-in attempted to check to see if the Mailer.UserProperty.class had already been set. The problem with this approach is that this class is set to null when a User object is first initialized.
diff --git a/src/main/java/hudson/plugins/git/ b/src/main/java/hudson/plugins/git/
index 7fb0327..3e9455f 100644
--- a/src/main/java/hudson/plugins/git/
+++ b/src/main/java/hudson/plugins/git/
@@ -232,13 +232,16 @@ public class GitChangeSet extends ChangeLogSet.Entry {
         User user = User.get(csAuthor, true);
-        // set email address for user if needed
-        if (fixEmpty(csAuthorEmail) != null && user.getProperty(Mailer.UserProperty.class) == null) {
-            try {
-                user.addProperty(new Mailer.UserProperty(csAuthorEmail));
-            } catch (IOException e) {
-                // ignore error
-            }
+       // won't work because it creates a recursive loop
+       //      String adrs = fixEmpty(user.getProperty(Mailer.UserProperty.class).getAddress());
+        // set email address for user -- removes the default one stored in Users table (null)
+        if (fixEmpty(csAuthorEmail) != null /* && adrs == null */) {
+           try {
+               user.addProperty(new Mailer.UserProperty(csAuthorEmail)); // addProperty() will overwrite the existing property
+           } catch(IOException e) {
+               // ignore error
+           }
         return user;

The problem is that when the User object is initialized inside Hudson, it also calls a load() function, which attempts to initialize all UserProperty extensions, including the one defined by the Mailer UserProperty:

private User(String id, String fullName) { = id;
        this.fullName = fullName;
      // allocate default instances if needed.                                                                                                                       
        // doing so after load makes sure that newly added user properties do get reflected                                                                            
        for (UserPropertyDescriptor d : UserProperty.all()) {
            if(getProperty(d.clazz)==null) {
                UserProperty up = d.newInstance(this);

By invoking the addProperty(Mailer.UserProperty(csAuthorEmail)) directly, Hudson will first attempt to look for an identical matching Mailer.UserProperty(), remove it, and replace it with this one. Therefore, this patch will cause the e-mail address of the Git committer to take the highest precedence. We can see how this works by looking at the addProperty() function in core/src/main/java/hudson/model/

     * Updates the user object by adding a property.                                                                                                                                                                                         
    public synchronized void addProperty(UserProperty p) throws IOException {
        UserProperty old = getProperty(p.getClass());
        List ps = new ArrayList(properties);
        properties = ps;
More background about the issue: If a default suffix (i.e. is used in the Hudson configuration, then Hudson will attempt to use the full name and append the, causing the javax.mail package to throw an exception. The code that handles all this resolve this info is the getAddress() function of the, which first tries to see if the private e-mail variable is set before attempting to look elsewhere:

        public String getAddress() {
                return emailAddress;

            // try the inference logic                                                                                                                                 
            return MailAddressResolver.resolve(user);

Since emailAddress is null, the MailAdderessResolver() tries to do a bunch of guesses (i.e. parse the string to see if there is an '@') before it settles on using the full-name + default hostname suffix. If you don't have a User explicitly defined in your Hudson configuration (i.e. a config.xml inside your users/ dir, or the user listed within the Manage Users screen), Hudson will set the default e-mail address to be null when the User object is initialized. When this happens, the Git plug-in will check to see if the user already has a Mailer.UserProperty. Since it does, the if check for the Hudson Git will fail and the Git committer's e-mail address will never be added.

What's the long term fix? The issue with the patch above is that if a User is explicitly defined in Hudson and is a Git committer, then the Git committer's email will always win. What if we want to keep it such that the User defined as a Hudson user is still has the highest priority?

The issue is the getAddress() function within the core/src/main/java/hudson/tasks/ file, which denotes the emailAddress variable as a private final variable. Therefore, emailAddress can only be set once (it will be set to null), and it cannot be modified by outside instantiations.

public static class UserProperty extends hudson.model.UserProperty {
         * The user's e-mail address.                                                                                             
         * Null to leave it to default.                                                                                                                                                                                                      
        private final String emailAddress;

        public UserProperty(String emailAddress) {
            this.emailAddress = emailAddress;

        public String getAddress() {
                return emailAddress;

            // try the inference logic                                                                                                                                                                                                       
            return MailAddressResolver.resolve(user);

        public static final class DescriptorImpl extends UserPropertyDescriptor {
            public String getDisplayName() {
                return Messages.Mailer_UserProperty_DisplayName();

            public UserProperty newInstance(User user) {
                return new UserProperty(null);

            public UserProperty newInstance(StaplerRequest req, JSONObject formData) throws FormException {
                return new UserProperty(req.getParameter("email.address"));

If we can change emailAddress not to be final and/or private, then we could modify the existing emailAddress or perform checks whether emailAddress is null before overwriting the Git committer email address.

FYI -- calling getAddress() unfortunately has the side-effect of calling MailAddressResolver.resolve, which will attempt to call getProjects() which will in turn eventually call getAddress() again. So either we make the emailAddress public or we need to expose a different function that explicitly returns back the emailAddress without calling MailAddressResolver.resolve(). A fix to make emailAddress public or a function to set the UserProperty only when emailAddress is null would then allow us to use Hudson-created accounts as the last word in e-mail commit notify messages.

Note: Hudson depends on source-code management plug-ins to have a function called getAuthor(), which is the main function at the heart of this issue. When the mailer wants to look for the culprits who broke the code (getCulprits), it outsources the getAuthor() function to the plug-in, which in turn can add a Mailer.UserProperty() to the User object for use in mail notifications. Unfortunately there seems to have been an unexpected consequence in allowing null Mailer.UserProperty's to be instantiated by the User constructor. Since I could find no other source-code management plugin for Hudson that implemented anything more complex than User.get() calls, it may be why this issue was not caught earlier.

Saturday, January 15, 2011

Compling Hudson plug-ins..

While trying to debug this issue with the Git plug-in for Hudson, I had to download the source code package and try to compile it to create my own custom plug-in. There isn't a very good HOWTO out there to install Hudson plug-ins in Ubuntu, so I'll outline them here.

First, you need to have the Sun (or now Oracle) Java version, not the OpenJDK version that comes with Ubuntu. Hudson appears to have dependencies on some libraries, so it's best to replace it with the Sun version. You also have to be wary of the fact that Ubuntu may have the OpenJDK installed already, so to ensure you're running the right version, I basically did an apt-get remove openjdk-jdk and openjdk-6-jre first before following these steps (JDK is needed).

Next, you need Maven2, which is the package management system for Java. One thing I did notice is that downloads are horrendously slow with the native /etc/maven2/settings.xml but renaming it helped solve the issue. Then I did a git clone of the Hudson-GIT-plugin code (Maven will install the Hudson core source code), and then copied the dotm2_settings.xml.

1. sudo apt-get install maven2
2. sudo mv /etc/maven2/settings.xml /etc/maven2/settings.xml.orig
3. git clone
4. cd Hudson-GIT-plugin
5. mvn compile
5. cp dotme_settings.xml ~/.m2/settings.xml)

These steps, after watching Maven download a large number of files, appeared to allow me to compile the Hudson Git plug-in.

Friday, January 14, 2011

How works..

The parsing code within (do a git clone works great great. The way the plug-in works is to do "git diff whatchanged --no-abbrev -M --pretty=raw" (see GitAPI).java. The diff structures that are parsed appears as follows:

tree 92b84d3b8bba31b80f011bf9f957190ad14b299c
parent 15a87109192f74dc9f82af23c9ffa047d8379622
author John Smith  1294997088 -0800
committer John Smith  1294997548 -0800

Within, the regex to parse through and extract the author, commiter, email address, timestamp, and time zone is performed. Everything looks great. Within the getAuthor() method, there is a line that attempts to create the user if the name does not exist. This line will create a User object with the csAuthor field passed-in, and a boolean field as the 2nd parameter determines whether to create a new account if the userid is not passed.

We can see a similar pattern in (you need to git clone

// register the user                                                                                                                                                               
        User user = createAccount(si.username,si.password1);
        user.addProperty(new Mailer.UserProperty(;
        return user;

Thursday, January 13, 2011

Installing Hudson with Django

The latest Hudson Ubuntu repositories seem to be available here: The site,, at one point seems to have all the links unavailable. They seem to be downloadable now, but in the interim, I ended up downloading from the first site.

The Caktus blog at does a pretty good job explaining the basic steps of getting Hudson up and running. One thing to do after install the Debian package is to change HTTP_PORT to 8081. This way using the Django test dev server, you don't need to remind yourself to change the port # to 8080.

If you setup a virtual domain, you can then setup a reverse proxy with the following Apache config.


  ProxyPass         /  http://localhost:8081/
  ProxyPassReverse  /  http://localhost:8081/
  ProxyRequests     Off

    Order deny,allow
    Allow from all

Another thing to note is that when using virtualenvwrapper, invoking the 'workon' function within the Hudson script triggers a non-zero exit code. If you invoke the shell script with "#!/bin/bash -ex", the non-zero exit code will cause Hudson to abort prematurely. This Google discussion group at talks about the different ways to resolve the issue within the virtualenvwrapper script but not how you can bypass this issue until virtualenvwrapper is truly fixed. One way to do it is to simply put a "|| true" statement to bypass the issue.

#!/bin/bash -ex
source /usr/local/bin/ || true
workon hudson-dev || true

./ test setup --settings=settings.hudson

Tuesday, January 11, 2011

Mutable arguments in function definitions...

This one always seems weird....


Default parameter values are evaluated when the function definition is executed. This means that the expression is evaluated once, when the function is defined, and that that same “pre-computed” value is used for each call. This is especially important to understand when a default parameter is a mutable object, such as a list or a dictionary: if the function modifies the object (e.g. by appending an item to a list), the default value is in effect modified. This is generally not what was intended. A way around this is to use None as the default, and explicitly test for it in the body of the function, e.g.:

def whats_on_the_telly(penguin=None):
  if penguin is None:
    penguin = []
  penguin.append("property of the zoo")
  return penguin
Function call semantics are described in more detail in section Calls. A function call always assigns values to all parameters mentioned in the parameter list, either from position arguments, from keyword arguments, or from default values. If the form “*identifier” is present, it is initialized to a tuple receiving any excess positional parameters, defaulting to the empty tuple. If the form “**identifier” is present, it is initialized to a new dictionary receiving any excess keyword arguments, defaulting to a new empty dictionary.

It is also possible to create anonymous functions (functions not bound to a name), for immediate use in expressions. This uses lambda forms, described in section Lambdas. Note that the lambda form is merely a shorthand for a simplified function definition; a function defined in a “def” statement can be passed around or assigned to another name just like a function defined by a lambda form. The “def” form is actually more powerful since it allows the execution of multiple statements.

Programmer’s note: Functions are first-class objects. A “def” form executed inside a function definition defines a local function that can be returned or passed around. Free variables used in the nested function can access the local variables of the function containing the def. See section Naming and binding for details.

Thursday, January 6, 2011

Downloading YouTube clips, stitching together MP3 files, and encoding MP4 files in Ubuntu 10.04 Lucid

Ever wanted to download YouTube clips, extract various audio clips from MP3 files and stitch them together, or simply re-encode DVD-clips from your home video collection into MP4 files so you can share?

All the tools for Ubuntu to download are available. However, many of them don't work out of the box. Here are the tools of trade:

youtube-dl - The one that comes with Ubuntu Lucid seems to have trouble downloading YouTube clips. You often see "ERROR: format not available for video" error messages. It seems that YouTube keeps changing their API interface to render the distro version obsolute. I had to download the latest Python script from the site in order to get anything to work. There is a bug filed at It may be better simply to just download the latest Python script at

ffmpeg - The one that comes with Ubuntu 10.04 has issues with encoding AAC, MP3, MPEG4 files, effectively rendering the distribution version not very useful. What you want to do is install Medibuntu, which is the media package not distributed with vanilla Ubuntu because of licensing restrictions. Medibuntu comes with an ffmpeg version that is compiled to support AAC/MP3/MP4 encoding (assuming the accompanying libraries are added). Take a look at You can use ffmpeg to extract MP3 clips from video files (check out

handbrake-gtk / handbrake-cli - You use this tool to encode DVD clips into MP4 files. The versions that comes with the Ubuntu 10.04 has the "Start" button disabled, which makes them non usable too. Video file sharing sites like SmugMug limit uploads to 10 minutes and 1Gb, but you can use the tool to save the output into separate files using handbrake-gtk and designating the chapter #'s to save. You can use the Handbrake GTK interface to add these chapters into a queue, so that you can batch everything up. You can also use the Handbrake CLI interface,
such as the following example which searches for all VOB files, creates a HandBrake CLI command for each file, and dumps out to stdout (you can pipe to 'sh' to execute):
find *.VOB | awk '{print "HandBrakeCLI -e x264  -q 20.0 -r 29.97 --pfr  -a 1 -E faac -B 160 -6 dpl2 -R Auto  -D 0.0 -f mp4 -4 -X 1024 --loose-anamorphic -m -i \"/mnt/cdrom/VIDEO_TS/"$0"\" -o \"/destdir/"$0".mp4\""}' 

You may be better off invoking "HandBrakeCIL --preset-list" and "HandbrakeCLI --help" to see how these commands work. But basically this find command will look for all VOB files and output them to their appropriate MP4 files.

For appending together MP3 files, make sure they are using the same bit-rate encoding.
fmpeg -i originalA.mp3 -f mp3 -ab 128kb -ar 44100 -ac 2 intermediateA.mp3  
ffmpeg -i originalB.mp3 -f mp3 -ab 128kb -ar 44100 -ac 2 intermediateB.mp3
Then, at runtime, concat your files together:

cat intermediateA.mp3 intermediateB.mp3 > output.mp3

SmugMug video uploading: Also, keep in mind that SmugMug gives you very little feedback about why an "Upload failed" with their new HTML5 drag-and-drop uploader. You have to rename the files to an .mp4 because it doesn't recognize the file format automatically and limit the size of your uploads to less than 1 Gb.

Wednesday, January 5, 2011

What the oauth_verifier is..

To make sure that the resource owner granting access is the same
resource owner returning back to the client to complete the process,
the server MUST generate a verification code: an unguessable value
passed to the client via the resource owner and REQUIRED to complete
the process. The server constructs the request URI by adding the
following REQUIRED parameters to the callback URI query component:

The temporary credentials identifier received from the client.

The verification code.

If the callback URI already includes a query component, the server
MUST append the OAuth parameters to the end of the existing query.

For example, the server redirects the resource owner's user-agent to
make the following HTTP request:

GET /cb?x=1&oauth_token=hdk48Djdsa&oauth_verifier=473f82d3 HTTP/1.1

If the client did not provide a callback URI, the server SHOULD
display the value of the verification code, and instruct the resource
owner to manually inform the client that authorization is completed.
If the server knows a client to be running on a limited device, it
SHOULD ensure that the verifier value is suitable for manual entry.

Saturday, January 1, 2011

Installing MoinMoin on Ubuntu v10.04

If you follow the MoinMoin Ubuntu v10.04 instructions, there are simply no directions on how to get some of the features such as RecentChanges working. It turns out that you need to login as a superuser, go to the LanguageSetup page, and install the system pages.

1. First, if you're using HTTP authentication, you can autocreate logins by modifying

from farmconfig import FarmConfig

# now we subclass that config (inherit from it) and change what's different:
class Config(FarmConfig):
    from MoinMoin.auth import GivenAuth
    auth = [GivenAuth(autocreate=True)]

2. Next, you'll need to temporarily add yourself as the superuser:

# This is checked by some rather critical and potentially harmful actions,
# like despam or PackageInstaller action:
# superuser = [u"myuserid", ]

3. Add the LanguageSetup page, and click on the "Install system pages".

From here there is a link to install packages that weren't included with the default install of MoinMoin.

Thanks to for the head's up.