从动态网络表中抓取数据

时间:2021-04-24 18:57:14

标签: python web-scraping python-requests

我想从带有动态表格的网页中抓取数据。 该表格包含有关乘坐火车的信息。

这是网站: https://www.laerm-monitoring.de/zug/?mp=3/

我尝试通过一个简单的挂载请求会话来请求数据,但我只得到了基本的 HTML 数据,而没有表格中的数据。

def requests_retry_session(
    retries=3,
    backoff_factor=0.3,
    status_forcelist=(500, 502, 504, 429),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session 

session = requests_retry_session()
response = session.get('https://www.laerm-monitoring.de/zug/?mp=3/')
response.content

我该如何正确执行此操作?

3 个答案:

答案 0 :(得分:3)

数据是从不同的 URL 动态加载的。您可以使用此示例如何仅使用 requests/beautifulsoup 加载它:

import json
import requests
from bs4 import BeautifulSoup

data = {
    "sort": "Einfahrtzeit-desc",
    "page": "1",
    "pageSize": "10",
    "group": "",
    "filter": "",
    "__RequestVerificationToken": "",
    "locid": "1",
}

headers = {"X-Requested-With": "XMLHttpRequest"}

url = "https://www.laerm-monitoring.de/zug/"
api_url = "https://www.laerm-monitoring.de/zug/train_read"

with requests.Session() as s:
    soup = BeautifulSoup(s.get(url).content, "html.parser")
    data["__RequestVerificationToken"] = soup.select_one(
        '[name="__RequestVerificationToken"]'
    )["value"]
    data = s.post(api_url, data=data, headers=headers).json()

# pretty print the data
print(json.dumps(data, indent=4))

打印:

{
    "Data": [
        {
            "id": 2536954,
            "Einfahrtzeit": "2021-04-24T20:56:26.1703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 7.3,
            "Zugl\u00e4nge": 181.85884,
            "Geschwindigkeit": 115.57797,
            "Maximalpegel": 88.611084,
            "Vorbeifahrtpegel": 85.421326,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536944,
            "Einfahrtzeit": "2021-04-24T20:52:25.1703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 6.3,
            "Zugl\u00e4nge": 211.10226,
            "Geschwindigkeit": 152.60104,
            "Maximalpegel": 91.81743,
            "Vorbeifahrtpegel": 87.95224,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536929,
            "Einfahrtzeit": "2021-04-24T20:44:31.4703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 5.3,
            "Zugl\u00e4nge": 104.69964,
            "Geschwindigkeit": 110.10052,
            "Maximalpegel": 82.100815,
            "Vorbeifahrtpegel": 79.98168,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536924,
            "Einfahrtzeit": "2021-04-24T20:42:30.3703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 2.9,
            "Zugl\u00e4nge": 49.305683,
            "Geschwindigkeit": 125.18,
            "Maximalpegel": 98.63289,
            "Vorbeifahrtpegel": 97.25019,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536925,
            "Einfahrtzeit": "2021-04-24T20:42:20.5703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 0.0,
            "Zugl\u00e4nge": 0.0,
            "Geschwindigkeit": 0.0,
            "Maximalpegel": 0.0,
            "Vorbeifahrtpegel": 0.0,
            "G\u00fcltig": "-"
        },
        {
            "id": 2536911,
            "Einfahrtzeit": "2021-04-24T20:35:19.3703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 4.1,
            "Zugl\u00e4nge": 103.97647,
            "Geschwindigkeit": 132.2034,
            "Maximalpegel": 87.111984,
            "Vorbeifahrtpegel": 85.6776,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536907,
            "Einfahrtzeit": "2021-04-24T20:33:31.2703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "GZ",
            "Zugkategorie": "G\u00fcterzug",
            "Vorbeifahrtdauer": 23.8,
            "Zugl\u00e4nge": 583.19586,
            "Geschwindigkeit": 95.63598,
            "Maximalpegel": 88.02967,
            "Vorbeifahrtpegel": 85.02115,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536890,
            "Einfahrtzeit": "2021-04-24T20:25:36.1703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 3.5,
            "Zugl\u00e4nge": 104.63446,
            "Geschwindigkeit": 160.47487,
            "Maximalpegel": 88.60612,
            "Vorbeifahrtpegel": 86.46721,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536882,
            "Einfahrtzeit": "2021-04-24T20:22:05.8703+02:00",
            "Gleis": 2,
            "Richtung": "Hamburg",
            "Category": "GZ",
            "Zugkategorie": "G\u00fcterzug",
            "Vorbeifahrtdauer": 26.6,
            "Zugl\u00e4nge": 653.52515,
            "Geschwindigkeit": 94.59859,
            "Maximalpegel": 91.9396,
            "Vorbeifahrtpegel": 85.50632,
            "G\u00fcltig": "OK"
        },
        {
            "id": 2536869,
            "Einfahrtzeit": "2021-04-24T20:16:24.3703+02:00",
            "Gleis": 1,
            "Richtung": "Kiel",
            "Category": "PZ",
            "Zugkategorie": "Personenzug",
            "Vorbeifahrtdauer": 3.3,
            "Zugl\u00e4nge": 87.8222,
            "Geschwindigkeit": 160.01207,
            "Maximalpegel": 91.3928,
            "Vorbeifahrtpegel": 89.54336,
            "G\u00fcltig": "OK"
        }
    ],
    "Total": 8657,
    "AggregateResults": null,
    "Errors": null
}

答案 1 :(得分:2)

通过简单的 GET 请求,您可以检索登陆页面的 HTML。

import requests

response = requests.get('https://www.laerm-monitoring.de/zug/')  # even without query-parameters: ?mp=3/
print( response.content )

分析动态请求(浏览器)

这也可以在任何浏览器中完成。 在源代码视图(在 Win/Linux 中:CRTL + U 或在 Mac 中:CMD + U)中,您将找到针对 REST API 的所有后续请求所需的令牌:__RequestVerificationToken。< /p>

它位于本页的隐藏 <input> 表单域中:

<input name="__RequestVerificationToken" type="hidden" value="CfDJ8B_eKmsiQC9Esc7ZjyC063dp6MzAtP3Sawnrfz3SCqxOMoPCYMV4sjDbrhDbuOsPcLnOiElgqQWTdMxCgfmhNVx1eC6oR81kZT3os2z3DJxtu6H9V7fKt9z9bdSJwB1ACYSSYWHsmPzt-AMWvSk4eYU" />

当页面在您的浏览器中加载时,此令牌将用于通过 JavaScript XMLHttpRequests (XHR) 动态加载数据(正如您已经假设的那样)。

要查看这些XHR 请求,请打开浏览器开发者工具窗口的网络标签(快捷键 F12):

browsers dev-tools network tab shows 2 XHR requests

两个请求都以 JSON 格式获取测量数据。出于安全原因,被调用的 Web API 需要使用 POST 请求发送的令牌。它在正文中作为 x-www-form-urlencoded分页 参数一起提交。

通过 cURL 从命令行查看以下示例:

curl -vi 'https://www.laerm-monitoring.de/zug/train_read' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data-raw 'sort=Einfahrtzeitdesc&page=1&pageSize=10&group=&filter=&__RequestVerificationToken=CfDJ8...

(token 被缩短用于说明目的)

提示:在浏览器的网络标签中,您通常可以右键单击请求 copy as CURL 命令。

答案 2 :(得分:1)

我用 Selenium 做了一些与 python 类似的事情。不确定这是否适合您。基本上打开网站并右键单击表格并执行inspect element。之后转到表所属的divright-click复制full xpath。找到 xpath 后,您可以使用 selenium 刮取它。见this answer

唯一的问题是Selenium实际上是打开浏览器而不是在后台运行。我想你可以默默地做,但我从来没有做过。

另一件事是,如果重复的自动请求来自单个 IP,网站可能会阻止您。每次发出请求时,您都可以使用 tor 从新 IP 发出请求。我做过类似的事情with twitter here