Thursday, May 5, 2011

Facebook's on.fb.me shortener on Facebook pages, hashbangs, and urlopen in Python..

Facebook's link shortener is actually a CNAME to bit.ly, and bit.ly has setup some type of nginx front-end that requires the Host: on.fb.me to be passed in. The urlopen() command will set a Host: (done in the AbstractHTTPHandler base class), but if you're testing with telnet, you need to make sure to set this Host: parameter explicitly.

It seems you have to add a Host: on.fb.me to the header:
telnet on.fb.me 80
Trying 168.143.174.97...
Connected to cname.bit.ly.
Escape character is '^]'.
GET /[your shortened link here] HTTP/1.1
Host: on.fb.me

HTTP/1.1 301 Moved
Server: nginx

What happens though if you wish to use urlopen() on a Facebook shortened link, which points to a Facebook page? The big issue is that Facebook is starting to use #! in their Facebook pages and so urlopen() will throw 404 errors.

>>> import urllib2
>>> urllib2.urlopen('http://on.fb.me/[your shortened link here>']
Traceback (most recent call last):
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 516, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

If we want to fix this issue, we have to build a custom HTTPRedirectHandler. For now, we just strip out the entire URL fragment if it contains a '#' symbol.

class CustomHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
    # If a redirect happens within a 301, we deal with it here.

   def redirect_request(self, req, fp, code, msg, hdrs, newurl):

       parsed_url = urlparse.urlparse(newurl)

       # See http://code.google.com/web/ajaxcrawling/docs/getting-started.html
       #
       # Strip out the hash fragment, since fragments are never (by
       # specification) sent to the server.  If you do, a 404 error can occur.
       # urllib2.urlopen() also will die a glorius death if you try, so you must
       # remove it.   See http://stackoverflow.com/questions/3798422 for more info.
       # Facebook does not really conform to the Google standard, so we can't
       # send the fragment as _escaped_fragment_=key=value.

       # Strip out the URL fragment and reconstruct everything if a hash tag exists.
       if newurl.find('#') != -1:
          newurl = "%s://%s%s" % (parsed_url.scheme, parsed_url.netloc, parsed_url.path)
       return urllib2.HTTPRedirectHandler.redirect_request(self, req, fp, code, msg, hdrs, newurl)

We can then do:
opener = urllib2.build_opener(CustomHTTPRedirectHandler())
req = urllib2.Request('http://on.fb.me[your shortened link here]')
print opener.open(req).read()

Also, urlopen() just does not like the '#' symbol and will report a 404 error...it just isn't obvious until you step through the urllib2.py code or install this redirect handler and add breakpoints to see what's going on....
http://stackoverflow.com/questions/3589003/urllib2-urlopen-throws-404-exception-for-urls-that-browser-opens

2 comments:

  1. Did you know you can shorten your links with Shortest and receive dollars from every click on your short links.

    ReplyDelete
  2. BlueHost is definitely the best hosting provider for any hosting services you might need.

    ReplyDelete