我有以下带有重音符号的网址链接:
https://www.janes.com/...tamandaré ...等
当我尝试请求链接时,出现错误:
UnicodeDecodeError:“ utf-8”编解码器无法解码位置中的字节0xe9 89:无效的继续字节
这是我的代码:
import requests
def request_site(url):
return requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0)'})
if __name__ == '__main__':
url = 'https://www.janes.com/article/87665/laad-2019-united-kingdom-s-sea-signs-mou-with-brazilian-siatt-for-tamandaré-class-corvette-torpedo-tubes'
print(request_site(url))
完整错误:
Traceback (most recent call last):
File "D:/OneDrive/PhD/Web Crawler/playground.py", line 104, in <module>
print(request_site(url))
File "D:/OneDrive/PhD/Web Crawler/playground.py", line 73, in request_site
return requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0)'})
File "C:\Python35\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Python35\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python35\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python35\lib\site-packages\requests\sessions.py", line 668, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python35\lib\site-packages\requests\sessions.py", line 668, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python35\lib\site-packages\requests\sessions.py", line 149, in resolve_redirects
url = self.get_redirect_target(resp)
File "C:\Python35\lib\site-packages\requests\sessions.py", line 115, in get_redirect_target
return to_native_string(location, 'utf8')
File "C:\Python35\lib\site-packages\requests\_internal_utils.py", line 25, in to_native_string
out = string.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 89: invalid continuation byte
我发现许多类似的问题(例如link),但没有一个提出了针对同一问题的解决方案,并且所有先前的解决方案都针对python2。
答案 0 :(得分:1)
只需要快速编码,但是您需要将http://
从url
中删除,因为它也会对其进行编码:
import requests
def request_site(url):
return requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0)'})
if __name__ == '__main__':
url = 'www.janes.com/article/87665/laad-2019-united-kingdom-s-sea-signs-mou-with-brazilian-siatt-for-tamandaré-class-corvette-torpedo-tubes'
url_encode = 'http://' + urllib.parse.quote(url.encode('latin-1'))
print(request_site(url_encode))