Question

我想从https://www.ijsselsteinloop.nl/uitslagen-2019上的数据表中抓取所有运行时间（而不仅仅是前10个结果）。但是，网页上显示的数据不会显示在页面源中。在每个数据表下都有一个超链接（“ hier”）。这些链接到完整的数据表页面。但是这些链接也不在页面源中。

任何建议或代码段（如何使用Python和BeautifulSoup或Scrapy来抓取这些数据）

。

Answer 1

使用页面用于该内容的相同端点。您可以在浏览器的“网络”标签中找到它。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('https://www.ijsselsteinloop.nl/uitslag/2019/index.html')
soup = bs(r.content, 'lxml')
links = ['https://www.ijsselsteinloop.nl/uitslag/2019/' + item['href'] for item in soup.select('[href^=uitslag]')]

for link in links:
    table = pd.read_html(link)[0]
    print(table)

Answer 2

您可以使用BeautifulSoup。首先：

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")

然后使用函数find.All（获取每个tr）。然后使用for循环，然后键入再次查找（'td'）以获得每一行

当数据表未显示在页面源中时如何抓取

2 个答案: