我想从带有动态表格的网页中抓取数据。 该表格包含有关乘坐火车的信息。
这是网站: https://www.laerm-monitoring.de/zug/?mp=3/
我尝试通过一个简单的挂载请求会话来请求数据,但我只得到了基本的 HTML 数据,而没有表格中的数据。
def requests_retry_session(
retries=3,
backoff_factor=0.3,
status_forcelist=(500, 502, 504, 429),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
session = requests_retry_session()
response = session.get('https://www.laerm-monitoring.de/zug/?mp=3/')
response.content
我该如何正确执行此操作?
答案 0 :(得分:3)
数据是从不同的 URL 动态加载的。您可以使用此示例如何仅使用 requests
/beautifulsoup
加载它:
import json
import requests
from bs4 import BeautifulSoup
data = {
"sort": "Einfahrtzeit-desc",
"page": "1",
"pageSize": "10",
"group": "",
"filter": "",
"__RequestVerificationToken": "",
"locid": "1",
}
headers = {"X-Requested-With": "XMLHttpRequest"}
url = "https://www.laerm-monitoring.de/zug/"
api_url = "https://www.laerm-monitoring.de/zug/train_read"
with requests.Session() as s:
soup = BeautifulSoup(s.get(url).content, "html.parser")
data["__RequestVerificationToken"] = soup.select_one(
'[name="__RequestVerificationToken"]'
)["value"]
data = s.post(api_url, data=data, headers=headers).json()
# pretty print the data
print(json.dumps(data, indent=4))
打印:
{
"Data": [
{
"id": 2536954,
"Einfahrtzeit": "2021-04-24T20:56:26.1703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 7.3,
"Zugl\u00e4nge": 181.85884,
"Geschwindigkeit": 115.57797,
"Maximalpegel": 88.611084,
"Vorbeifahrtpegel": 85.421326,
"G\u00fcltig": "OK"
},
{
"id": 2536944,
"Einfahrtzeit": "2021-04-24T20:52:25.1703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 6.3,
"Zugl\u00e4nge": 211.10226,
"Geschwindigkeit": 152.60104,
"Maximalpegel": 91.81743,
"Vorbeifahrtpegel": 87.95224,
"G\u00fcltig": "OK"
},
{
"id": 2536929,
"Einfahrtzeit": "2021-04-24T20:44:31.4703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 5.3,
"Zugl\u00e4nge": 104.69964,
"Geschwindigkeit": 110.10052,
"Maximalpegel": 82.100815,
"Vorbeifahrtpegel": 79.98168,
"G\u00fcltig": "OK"
},
{
"id": 2536924,
"Einfahrtzeit": "2021-04-24T20:42:30.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 2.9,
"Zugl\u00e4nge": 49.305683,
"Geschwindigkeit": 125.18,
"Maximalpegel": 98.63289,
"Vorbeifahrtpegel": 97.25019,
"G\u00fcltig": "OK"
},
{
"id": 2536925,
"Einfahrtzeit": "2021-04-24T20:42:20.5703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 0.0,
"Zugl\u00e4nge": 0.0,
"Geschwindigkeit": 0.0,
"Maximalpegel": 0.0,
"Vorbeifahrtpegel": 0.0,
"G\u00fcltig": "-"
},
{
"id": 2536911,
"Einfahrtzeit": "2021-04-24T20:35:19.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 4.1,
"Zugl\u00e4nge": 103.97647,
"Geschwindigkeit": 132.2034,
"Maximalpegel": 87.111984,
"Vorbeifahrtpegel": 85.6776,
"G\u00fcltig": "OK"
},
{
"id": 2536907,
"Einfahrtzeit": "2021-04-24T20:33:31.2703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "GZ",
"Zugkategorie": "G\u00fcterzug",
"Vorbeifahrtdauer": 23.8,
"Zugl\u00e4nge": 583.19586,
"Geschwindigkeit": 95.63598,
"Maximalpegel": 88.02967,
"Vorbeifahrtpegel": 85.02115,
"G\u00fcltig": "OK"
},
{
"id": 2536890,
"Einfahrtzeit": "2021-04-24T20:25:36.1703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 3.5,
"Zugl\u00e4nge": 104.63446,
"Geschwindigkeit": 160.47487,
"Maximalpegel": 88.60612,
"Vorbeifahrtpegel": 86.46721,
"G\u00fcltig": "OK"
},
{
"id": 2536882,
"Einfahrtzeit": "2021-04-24T20:22:05.8703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "GZ",
"Zugkategorie": "G\u00fcterzug",
"Vorbeifahrtdauer": 26.6,
"Zugl\u00e4nge": 653.52515,
"Geschwindigkeit": 94.59859,
"Maximalpegel": 91.9396,
"Vorbeifahrtpegel": 85.50632,
"G\u00fcltig": "OK"
},
{
"id": 2536869,
"Einfahrtzeit": "2021-04-24T20:16:24.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 3.3,
"Zugl\u00e4nge": 87.8222,
"Geschwindigkeit": 160.01207,
"Maximalpegel": 91.3928,
"Vorbeifahrtpegel": 89.54336,
"G\u00fcltig": "OK"
}
],
"Total": 8657,
"AggregateResults": null,
"Errors": null
}
答案 1 :(得分:2)
通过简单的 GET 请求,您可以检索登陆页面的 HTML。
import requests
response = requests.get('https://www.laerm-monitoring.de/zug/') # even without query-parameters: ?mp=3/
print( response.content )
这也可以在任何浏览器中完成。
在源代码视图(在 Win/Linux 中:CRTL + U 或在 Mac 中:CMD + U)中,您将找到针对 REST API 的所有后续请求所需的令牌:__RequestVerificationToken
。< /p>
它位于本页的隐藏 <input>
表单域中:
<input name="__RequestVerificationToken" type="hidden" value="CfDJ8B_eKmsiQC9Esc7ZjyC063dp6MzAtP3Sawnrfz3SCqxOMoPCYMV4sjDbrhDbuOsPcLnOiElgqQWTdMxCgfmhNVx1eC6oR81kZT3os2z3DJxtu6H9V7fKt9z9bdSJwB1ACYSSYWHsmPzt-AMWvSk4eYU" />
当页面在您的浏览器中加载时,此令牌将用于通过 JavaScript XMLHttpRequest
s (XHR) 动态加载数据(正如您已经假设的那样)。>
要查看这些XHR 请求,请打开浏览器开发者工具窗口的网络标签(快捷键 F12):
两个请求都以 JSON 格式获取测量数据。出于安全原因,被调用的 Web API 需要使用 POST 请求发送的令牌。它在正文中作为 x-www-form-urlencoded
与 分页 参数一起提交。
通过 cURL 从命令行查看以下示例:
curl -vi 'https://www.laerm-monitoring.de/zug/train_read' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data-raw 'sort=Einfahrtzeitdesc&page=1&pageSize=10&group=&filter=&__RequestVerificationToken=CfDJ8...
(token 被缩短用于说明目的)
提示:在浏览器的网络标签中,您通常可以右键单击请求 copy as CURL 命令。
答案 2 :(得分:1)
我用 Selenium 做了一些与 python 类似的事情。不确定这是否适合您。基本上打开网站并右键单击表格并执行inspect element
。之后转到表所属的div
和right-click
复制full xpath
。找到 xpath 后,您可以使用 selenium 刮取它。见this answer。
唯一的问题是Selenium实际上是打开浏览器而不是在后台运行。我想你可以默默地做,但我从来没有做过。
另一件事是,如果重复的自动请求来自单个 IP,网站可能会阻止您。每次发出请求时,您都可以使用 tor 从新 IP 发出请求。我做过类似的事情with twitter here。