BeautifulSoup在某些网站上返回403错误

时间:2019-01-11 21:57:21

标签: python-3.x beautifulsoup http-status-code-403

我不明白为什么其中一些网站出现403错误。

如果我手动访问URL,则页面加载正常。除了403响应外没有其他错误消息,所以我不知道如何诊断问题。

from bs4 import BeautifulSoup
import requests    

test_sites = [
 'http://fashiontoast.com/',
 'http://becauseimaddicted.net/',
 'http://www.lefashion.com/',
 'http://www.seaofshoes.com/',
 ]

for site in test_sites:
    print(site)
    #get page soure
    response = requests.get(site)
    print(response)
    #print(response.text)

运行上述代码的结果是...

http://fashiontoast.com/

Response [403]

http://becauseimaddicted.net/

Response [403]

http://www.lefashion.com/

Response [200]

http://www.seaofshoes.com/

Response [200]

请问有人可以帮助我了解问题的原因和解决方法吗?

1 个答案:

答案 0 :(得分:1)

有时页面会拒绝无法识别用户代理的GET请求。

使用浏览器(Chrome)访问该页面。右键单击“检查”。复制GET请求的User-Agent标头(在“网络”标签中查看。

enter image description here

from bs4 import BeautifulSoup
import requests

with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }


test_sites = [
 'http://fashiontoast.com/',
 'http://becauseimaddicted.net/',
 'http://www.lefashion.com/',
 'http://www.seaofshoes.com/',
 ]

for site in test_sites:
    print(site)
    #get page soure
    response = se.get(site)
    print(response)
    #print(response.text)

输出:

http://fashiontoast.com/
<Response [200]>
http://becauseimaddicted.net/
<Response [200]>
http://www.lefashion.com/
<Response [200]>
http://www.seaofshoes.com/
<Response [200]>