Question

我正在尝试使用Beautifulsoup从www.instacart.com抓取信息。到目前为止，这是我的代码：

import requests
from bs4 import BeautifulSoup

session = requests.Session()
response = session.get('https://www.instacart.com')

content = BeautifulSoup(response.text, "html.parser")

print (content)

我正在使用会话，因为我打算稍后再提出授权请求。该代码对我尝试过的所有网站都适用，但对于instacart而言却不能，因为某种原因，它只是打印“非常抱歉”。在我的VSCode控制台中。这是我第一次尝试使用Python，并且对这个特定错误进行谷歌搜索是徒劳的。有人可以帮忙吗？

Answer 1

几个网站不允许进行网页抓取，Instacart可能就是其中之一。

基于您可以阅读here的Instacart条款

... ...您只能通过以下方式访问服务： Instacart为此目的提供了（例如，您可能不会“抓取” 通过自动方式或“构架” 服务）... ...

Answer 2

要诱使服务器认为您的脚本不是机器人，只需使用user-agent header。
提醒您，如果您使用I / O高估了它们可以阻止您的IP

import requests

session = requests.Session()
header = {'user-agent': "I'm tricking you"}
response = requests.get('https://www.instacart.com', headers=header)
print(response.text)

使用Beautifulsoup抓取网站时出现问题

2 个答案: