我收到了一个网址:https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-桌面虚拟化解决方案;它来自BeautifulSoup。
url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
我想再次反馈到urllib2.urlopen。
import urllib2
source = urllib2.urlopen(url).read()
我得到的错误:
UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence
因此,我试过了:
source = urllib2.urlopen(url.encode("utf-8")).read()
它有页面源,但它与原始URL不同。
originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source
结果是假的。有没有想法修复这个网址?如何将u'\ xae'转换为原始®
?
答案 0 :(得分:3)
URL必须是有效的字节字符串,并且非ASCII码点编码正确。你需要编码为UTF-8,然后url引用你的网址:
import urllib
import urllib2
import urlparse
originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
source = urllib2.urlopen(encoded_link).read()
演示:
>>> import urllib
>>> import urllib2
>>> import urlparse
>>> originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
>>> parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> encoded_link = parsed_link.geturl()
>>> encoded_link
'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'
>>> source = urllib2.urlopen(encoded_link).read()
>>> len(source)
68758