Question

我正在尝试访问https://instacart.com/store/wegmans/storefront进行网页抓取，但是当我尝试使用以下代码登录Python的请求时：

from requests import session
url = 'https://www.instacart.com'
payload = {
    'action': 'submit',
    'email': 'my_email@gmail.com',
    'password': 'my_password'
}

with session() as c:
    c.post(url, data=payload)
    response = c.get('https://instacart.com/store/wegmans/storefront')
    print(response.headers)
    print(response.text)

我得到“非常抱歉”。作为response.text，以下为response.headers：

{'Date': 'Tue, 02 Jul 2019 02:58:57 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'nginx', 'Set-Cookie': 'build_sha=8f3eb623f91516ad5369c4c373e577ec406c0fa1;Path=/;', 'Cache-Control': 'no-cache', 'X-Request-Id': 'a13241fe-fdce-4eb5-bfa2-958118c7690c', 'X-Runtime': '0.007429', 'Vary': 'Origin'}

我不知道这意味着什么，但是我猜是“非常抱歉”。是无法识别您的POST请求时的自动响应。手动登录时，密码和电子邮件有效，并且我认为'action':'submit'部分正确，因为检查登录按钮显示它的类型为“ submit”。

我想知道这是否与instacart.com没有将您定向到登录页面的URL有关。主页上有一个登录表单，但是您必须单击“已经有一个帐户？登录”，然后它才会弹出。这是问题还是我的代码有问题？

Answer 1

这似乎可以登录：

import requests
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}

session = requests.Session()

res1 = session.get('http://www.instacart.com', headers=headers)
soup = BeautifulSoup(res1.content, 'html.parser')
token = soup.find('meta', {'name': 'csrf-token'}).get('content')
data = {"user": {"email": "your_email", "password": "your_password"},
        "authenticity_token": token}
res2 = session.post('https://www.instacart.com/accounts/login', headers=headers, data=data)
print(res2)
res3 = session.get('https://instacart.com/store/wegmans/storefront', headers=headers)
print(res3)
session.close()

正如@andreilozhkin所说，从Chrome DevTools中，您可以确切地看到将哪些有效负载传递给POST请求，其中包括“ authentacity_token”。我首先向http://www.instacart.com发出GET请求，然后在PUT请求登录中使用了该令牌。

希望这会有所帮助。

Answer 2

我相信，Kamal的答案将不再起作用。如果您查看POST请求，除了电子邮件，密码和authenticity_token之外，还有另一个名为captcha的字段。

我认为这是一个不可见的Google Recaptcha字段。登录页面确实加载了Google的Recaptcha库（https://www.google.com/recaptcha/api.js），但是通常没有期望的 g-recaptcha 类字段。我不确定如何获取验证码。

但是，另一种完全绕过登录的方法是使用库browser_cookies3。然后，您可以使用浏览器登录并在发出请求时加载Cookie。

import requests
import browser_cookie3

# make sure to login to instacart.com before running this script
cookies = browser_cookie3.chrome(domain_name='instacart.com')

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'}

url = 'https://www.instacart.com/v3/retailers'  # example endpoint that requires authentication

req = requests.get(url, headers=headers, cookies=cookies)

使用python请求失败登录网站

2 个答案: