使用用户代理标头时Webscrapping CrunchBase访问被拒绝

时间:2019-04-18 15:44:14

标签: python web-scraping beautifulsoup python-requests wget

我正试图通过Web抓紧Crunch Base来查找某些公司的总资金额。 Here is a link为例。

起初,我只是尝试使用漂亮的汤,但我总是收到错误消息:

  

已拒绝访问此页面,因为我们认为您正在使用自动化工具浏览该网站。

因此,我查看了如何伪造浏览器访问,并更改了代码,但仍然遇到相同的错误。我究竟做错了什么??

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

1 个答案:

答案 0 :(得分:2)

所有代码看起来都很不错!您要抓取的网站似乎比您拥有的网站标题更复杂。以下代码可以解决您的问题:

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content)

希望这会有所帮助!