It seems you have to add a Host: on.fb.me to the header:
telnet on.fb.me 80 Trying 168.143.174.97... Connected to cname.bit.ly. Escape character is '^]'. GET /[your shortened link here] HTTP/1.1 Host: on.fb.me HTTP/1.1 301 Moved Server: nginx
What happens though if you wish to use urlopen() on a Facebook shortened link, which points to a Facebook page? The big issue is that Facebook is starting to use #! in their Facebook pages and so urlopen() will throw 404 errors.
>>> import urllib2 >>> urllib2.urlopen('http://on.fb.me/[your shortened link here>'] Traceback (most recent call last): result = func(*args) File "/usr/lib/python2.6/urllib2.py", line 516, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 404: Not Found
If we want to fix this issue, we have to build a custom HTTPRedirectHandler. For now, we just strip out the entire URL fragment if it contains a '#' symbol.
class CustomHTTPRedirectHandler(urllib2.HTTPRedirectHandler): # If a redirect happens within a 301, we deal with it here. def redirect_request(self, req, fp, code, msg, hdrs, newurl): parsed_url = urlparse.urlparse(newurl) # See http://code.google.com/web/ajaxcrawling/docs/getting-started.html # # Strip out the hash fragment, since fragments are never (by # specification) sent to the server. If you do, a 404 error can occur. # urllib2.urlopen() also will die a glorius death if you try, so you must # remove it. See http://stackoverflow.com/questions/3798422 for more info. # Facebook does not really conform to the Google standard, so we can't # send the fragment as _escaped_fragment_=key=value. # Strip out the URL fragment and reconstruct everything if a hash tag exists. if newurl.find('#') != -1: newurl = "%s://%s%s" % (parsed_url.scheme, parsed_url.netloc, parsed_url.path) return urllib2.HTTPRedirectHandler.redirect_request(self, req, fp, code, msg, hdrs, newurl)
We can then do:
opener = urllib2.build_opener(CustomHTTPRedirectHandler()) req = urllib2.Request('http://on.fb.me[your shortened link here]') print opener.open(req).read()
Also, urlopen() just does not like the '#' symbol and will report a 404 error...it just isn't obvious until you step through the urllib2.py code or install this redirect handler and add breakpoints to see what's going on....
http://stackoverflow.com/questions/3589003/urllib2-urlopen-throws-404-exception-for-urls-that-browser-opens
No comments:
Post a Comment