有没有人知道修复“破损”网址的库。当我尝试打开诸如
之类的网址时http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
urllib2.urlopen chokes并给我一个HTTPError回溯。有没有人知道可以解决这些问题的图书馆?
答案 0 :(得分:2)
像......这样的事情:
import re
import urlparse
urls = '''
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
'''.split()
def main():
for u in urls:
pieces = list(urlparse.urlparse(u))
pieces[2] = re.sub(r'^[./]*', '/', pieces[2])
pieces[-1] = ''
print urlparse.urlunparse(pieces)
main()
它确实会发出,如你所愿:
http://www.domain.com/page.html
http://www.domain.com/page.html
http://www.domain.com/page.html
如果我理解正确的话,似乎大致符合您的需求。
答案 1 :(得分:1)