Saturday, June 16, 2012

Using Cython and lxml

If you've ever had a chance to use the lxml library, you'll notice that it's written in Cython. Cython allows you to build Python-like code in .pyx that are compiled into .c code and then linked a share object (.o) file that can be linked inside Python. External declarations are defined in .pxd (much like .h header files) and C typedefs are also supported.

There are a few quirks about Cython that I discovered after making improvements to the library. . If you're learning Cython, you'll also need to have some C background too since the language enables you to write hybrid Python and C code.

 * You can intermingle Python objects and C objects in your code. Objects that do not have an explicit type declared are considered Python objects, but prefixing your declarations (i.e. int, float, char) will force the variable to be C variables:

x = 1234         # Python integer 
int f = 1234     # C integer
* If you wish to declare C-type functions, you need to prefix your declarations as "cdef" (for C) and "def" for Python functions. If you want a return value associated with your C function that is a C object, you also need to define the type:
cdef int myfunc():    # returns a C integer
   return 1234

def myfunc():         # return a Python integer
   return 1
* Cython takes care of converting between Python and C objects according to this table. You can therefore move between C-strings (char) and Python string objects (bytes) just by setting the objects to each other:
# Declarations
b = 1234  
int a

# Assignment
a = b
* Cython does not have the '*' operation to dereference pointers. You need to do use p[0] instead of *p (There is no unary * operator in Cython. Instead of *p, use p[0]. Check out the section on Differences between C and Cython expressions.
int a = 1

int *my_ptr
my_ptr[0] = &a
* Cython performs casting by angle brackets instead of parenthesis:
x = <int> 1.23
* Cython does not appear to know what to do with double pointers. Therefore, if you have an array of character strings that you need to return, you need to declare a cdef function with the return type and malloc() the appropriate amount of memory. You'll also need to make sure you to free() the memory after using the data to avoid memory leaks!
cdef char **convertfunc(py_array=None):
  cdef char **c_my_double_char_ptr

  if not py_array:
    py_array = []

  num_entries = len(py_array)

  c_array = python.PyMem_Malloc(sizeof(char *) * num_entries)

  # Converting Python object to C type
  for n, p_entry in enumerate(py_array):
    c_my_double_char_ptr[n] = p_entry

  c_array[num_entries] = NULL  # last entry needs to be NULL
  return c_array
To release the memory allocated, you would need to do:
  
cdef char **my_char_ptr

my_char_ptr = <char **>convertfunc(["abc", "def"])
python.PyMem_Free(my_char_ptr)

Disclaimer: some of the examples above have been created to help illustrate a concept.  If you find a typo, please send a note to correct it!

Also, if you want to recompile the lxml library, make sure you pip install Cython. Otherwise, lxml will
use the pre-generated .C files that came with the package when type "make".

No comments:

Post a Comment