Thursday, May 5, 2011

Facebook's on.fb.me shortener on Facebook pages, hashbangs, and urlopen in Python..

Facebook's link shortener is actually a CNAME to bit.ly, and bit.ly has setup some type of nginx front-end that requires the Host: on.fb.me to be passed in. The urlopen() command will set a Host: (done in the AbstractHTTPHandler base class), but if you're testing with telnet, you need to make sure to set this Host: parameter explicitly.

It seems you have to add a Host: on.fb.me to the header:
telnet on.fb.me 80
Trying 168.143.174.97...
Connected to cname.bit.ly.
Escape character is '^]'.
GET /[your shortened link here] HTTP/1.1
Host: on.fb.me

HTTP/1.1 301 Moved
Server: nginx

What happens though if you wish to use urlopen() on a Facebook shortened link, which points to a Facebook page? The big issue is that Facebook is starting to use #! in their Facebook pages and so urlopen() will throw 404 errors.

>>> import urllib2
>>> urllib2.urlopen('http://on.fb.me/[your shortened link here>']
Traceback (most recent call last):
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 516, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

If we want to fix this issue, we have to build a custom HTTPRedirectHandler. For now, we just strip out the entire URL fragment if it contains a '#' symbol.

class CustomHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
    # If a redirect happens within a 301, we deal with it here.

   def redirect_request(self, req, fp, code, msg, hdrs, newurl):

       parsed_url = urlparse.urlparse(newurl)

       # See http://code.google.com/web/ajaxcrawling/docs/getting-started.html
       #
       # Strip out the hash fragment, since fragments are never (by
       # specification) sent to the server.  If you do, a 404 error can occur.
       # urllib2.urlopen() also will die a glorius death if you try, so you must
       # remove it.   See http://stackoverflow.com/questions/3798422 for more info.
       # Facebook does not really conform to the Google standard, so we can't
       # send the fragment as _escaped_fragment_=key=value.

       # Strip out the URL fragment and reconstruct everything if a hash tag exists.
       if newurl.find('#') != -1:
          newurl = "%s://%s%s" % (parsed_url.scheme, parsed_url.netloc, parsed_url.path)
       return urllib2.HTTPRedirectHandler.redirect_request(self, req, fp, code, msg, hdrs, newurl)

We can then do:
opener = urllib2.build_opener(CustomHTTPRedirectHandler())
req = urllib2.Request('http://on.fb.me[your shortened link here]')
print opener.open(req).read()

Also, urlopen() just does not like the '#' symbol and will report a 404 error...it just isn't obvious until you step through the urllib2.py code or install this redirect handler and add breakpoints to see what's going on....
http://stackoverflow.com/questions/3589003/urllib2-urlopen-throws-404-exception-for-urls-that-browser-opens

No comments:

Post a Comment