需要登录时如何从网站上抓取数据?

时间:2020-05-04 13:46:35

标签: python web-scraping beautifulsoup

我是一个完全菜鸟,第一次尝试抓取数据。我观看了一些视频,并阅读了许多文章,以了解如何抓取数据。到目前为止,我编写的代码是:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://mijn.makelaarsland.nl/aanbod/kaart'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parser
page_soup = soup(page_html, "html.parser")
page_soup.body.div

尝试解析数据时出现的问题是我遇到了这个问题:

<div class="login-background"></div>

我看了很多视频,并尝试编写一些代码使它们全部正常工作,但我不理解。 也许有人可以帮助我,告诉我我做错了什么。

以下可能是一些有用的信息:

This is the log in URL:
LOGIN_URL = "https://mijn.makelaarsland.nl/inloggen"


content-type: application/x-www-form-urlencoded

An overview of the network page when I right-click on 'Inspect'

2 个答案:

答案 0 :(得分:0)

正如我在评论中所写,我建议使用requests python软件包。该软件包具有出色的文档,您可以在线找到许多教程。登录requests.Session()范围内的网站,导航至所需页面,然后使用beautifulsoup进行抓取。

以下是根据https://stackoverflow.com/a/17633072/5666087

改编而成的代码示例
import requests

# Fill in your details here to be posted to the login form.
payload = {
    "MyAccount.Username": "username",
    "MyAccount.Password": "password"
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    p = s.post("https://mijn.makelaarsland.nl/inloggen", data=payload)
    # An authorized request.
    r = s.get("https://mijn.makelaarsland.nl/aanbod/kaart")
    print("status code:", r.status_code)
    page_soup = soup(r.text, "html.parser")
    page_soup.body.div

答案 1 :(得分:0)

我应该已经解决了BeautifulSoup问题。另外,我认为我需要添加_RequestVerificationToken。

import requests
from bs4 import BeautifulSoup

headers = {"user-agent" : "Mozilla/5.0 ... etc."
          }

login_data = {
    "MyAccount.Username": "myusername",
    "MyAccount.Password": "mypassword",
    "RembemberMe" : "false"
}


with requests.Session() as s:
    url = 'https://mijn.makelaarsland.nl/inloggen?ReturnUrl=%2faanbod%2fkaart'
    r = s.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    login_data[_RequestVerificationToken] = soup.find('input', attrs={'name' : '_RequestVerificationToken'})['value']
    r = s.post(url, data=login_data, headers=headers)

    print(r.content)

但是它返回:

TypeError                                 Traceback (most recent call last)
<ipython-input-52-5509032e4ad3> in <module>
     16     r = s.get(url, headers=headers)
     17     soup = BeautifulSoup(r.content, 'html.parser')
---> 18     login_data[_RequestVerificationToken] = soup.find('input', attrs={'name' : '_RequestVerificationToken'})['value']
     19     r = s.post(url, data=login_data, headers=headers)
     20 

TypeError: 'NoneType' object is not subscriptable

在这里我该怎么办?