Thursday, May 5, 2011

Facebook's shortener on Facebook pages, hashbangs, and urlopen in Python..

Facebook's link shortener is actually a CNAME to, and has setup some type of nginx front-end that requires the Host: to be passed in. The urlopen() command will set a Host: (done in the AbstractHTTPHandler base class), but if you're testing with telnet, you need to make sure to set this Host: parameter explicitly.

It seems you have to add a Host: to the header:
telnet 80
Connected to
Escape character is '^]'.
GET /[your shortened link here] HTTP/1.1

HTTP/1.1 301 Moved
Server: nginx

What happens though if you wish to use urlopen() on a Facebook shortened link, which points to a Facebook page? The big issue is that Facebook is starting to use #! in their Facebook pages and so urlopen() will throw 404 errors.

>>> import urllib2
>>> urllib2.urlopen('[your shortened link here>']
Traceback (most recent call last):
    result = func(*args)
  File "/usr/lib/python2.6/", line 516, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

If we want to fix this issue, we have to build a custom HTTPRedirectHandler. For now, we just strip out the entire URL fragment if it contains a '#' symbol.

class CustomHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
    # If a redirect happens within a 301, we deal with it here.

   def redirect_request(self, req, fp, code, msg, hdrs, newurl):

       parsed_url = urlparse.urlparse(newurl)

       # See
       # Strip out the hash fragment, since fragments are never (by
       # specification) sent to the server.  If you do, a 404 error can occur.
       # urllib2.urlopen() also will die a glorius death if you try, so you must
       # remove it.   See for more info.
       # Facebook does not really conform to the Google standard, so we can't
       # send the fragment as _escaped_fragment_=key=value.

       # Strip out the URL fragment and reconstruct everything if a hash tag exists.
       if newurl.find('#') != -1:
          newurl = "%s://%s%s" % (parsed_url.scheme, parsed_url.netloc, parsed_url.path)
       return urllib2.HTTPRedirectHandler.redirect_request(self, req, fp, code, msg, hdrs, newurl)

We can then do:
opener = urllib2.build_opener(CustomHTTPRedirectHandler())
req = urllib2.Request('[your shortened link here]')

Also, urlopen() just does not like the '#' symbol and will report a 404 just isn't obvious until you step through the code or install this redirect handler and add breakpoints to see what's going on....

1 comment:

  1. Did you know you can shorten your links with Shortest and receive dollars from every click on your short links.