我不明白为什么其中一些网站出现403错误。
如果我手动访问URL,则页面加载正常。除了403响应外没有其他错误消息,所以我不知道如何诊断问题。
from bs4 import BeautifulSoup
import requests
test_sites = [
'http://fashiontoast.com/',
'http://becauseimaddicted.net/',
'http://www.lefashion.com/',
'http://www.seaofshoes.com/',
]
for site in test_sites:
print(site)
#get page soure
response = requests.get(site)
print(response)
#print(response.text)
运行上述代码的结果是...
http://fashiontoast.com/
Response [403]
http://becauseimaddicted.net/
Response [403]
http://www.lefashion.com/
Response [200]
http://www.seaofshoes.com/
Response [200]
请问有人可以帮助我了解问题的原因和解决方法吗?
答案 0 :(得分:1)
有时页面会拒绝无法识别用户代理的GET请求。
使用浏览器(Chrome)访问该页面。右键单击“检查”。复制GET请求的User-Agent标头(在“网络”标签中查看。
from bs4 import BeautifulSoup
import requests
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
test_sites = [
'http://fashiontoast.com/',
'http://becauseimaddicted.net/',
'http://www.lefashion.com/',
'http://www.seaofshoes.com/',
]
for site in test_sites:
print(site)
#get page soure
response = se.get(site)
print(response)
#print(response.text)
输出:
http://fashiontoast.com/
<Response [200]>
http://becauseimaddicted.net/
<Response [200]>
http://www.lefashion.com/
<Response [200]>
http://www.seaofshoes.com/
<Response [200]>