我是一个完全菜鸟,第一次尝试抓取数据。我观看了一些视频,并阅读了许多文章,以了解如何抓取数据。到目前为止,我编写的代码是:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://mijn.makelaarsland.nl/aanbod/kaart'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parser
page_soup = soup(page_html, "html.parser")
page_soup.body.div
尝试解析数据时出现的问题是我遇到了这个问题:
<div class="login-background"></div>
我看了很多视频,并尝试编写一些代码使它们全部正常工作,但我不理解。 也许有人可以帮助我,告诉我我做错了什么。
以下可能是一些有用的信息:
This is the log in URL:
LOGIN_URL = "https://mijn.makelaarsland.nl/inloggen"
content-type: application/x-www-form-urlencoded
An overview of the network page when I right-click on 'Inspect'
答案 0 :(得分:0)
正如我在评论中所写,我建议使用requests
python软件包。该软件包具有出色的文档,您可以在线找到许多教程。登录requests.Session()
范围内的网站,导航至所需页面,然后使用beautifulsoup进行抓取。
以下是根据https://stackoverflow.com/a/17633072/5666087
改编而成的代码示例import requests
# Fill in your details here to be posted to the login form.
payload = {
"MyAccount.Username": "username",
"MyAccount.Password": "password"
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post("https://mijn.makelaarsland.nl/inloggen", data=payload)
# An authorized request.
r = s.get("https://mijn.makelaarsland.nl/aanbod/kaart")
print("status code:", r.status_code)
page_soup = soup(r.text, "html.parser")
page_soup.body.div
答案 1 :(得分:0)
我应该已经解决了BeautifulSoup问题。另外,我认为我需要添加_RequestVerificationToken。
import requests
from bs4 import BeautifulSoup
headers = {"user-agent" : "Mozilla/5.0 ... etc."
}
login_data = {
"MyAccount.Username": "myusername",
"MyAccount.Password": "mypassword",
"RembemberMe" : "false"
}
with requests.Session() as s:
url = 'https://mijn.makelaarsland.nl/inloggen?ReturnUrl=%2faanbod%2fkaart'
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
login_data[_RequestVerificationToken] = soup.find('input', attrs={'name' : '_RequestVerificationToken'})['value']
r = s.post(url, data=login_data, headers=headers)
print(r.content)
但是它返回:
TypeError Traceback (most recent call last)
<ipython-input-52-5509032e4ad3> in <module>
16 r = s.get(url, headers=headers)
17 soup = BeautifulSoup(r.content, 'html.parser')
---> 18 login_data[_RequestVerificationToken] = soup.find('input', attrs={'name' : '_RequestVerificationToken'})['value']
19 r = s.post(url, data=login_data, headers=headers)
20
TypeError: 'NoneType' object is not subscriptable
在这里我该怎么办?