Question

以下页面是我尝试从中收集信息的示例页面。 https://www.hockey-reference.com/boxscores/201610130TBL.html这有点难以辨别，但实际上有8个表，因为它通过与其他表相同的类名调用Scoring summary和Penalty summary。

并且我尝试使用以下代码访问表格，稍微修改以尝试解决问题。

import os
from bs4 import BeautifulSoup # imports BeautifulSoup

file = open("Detroit_vs_Tampa.txt")
data = file.read()
file.close()

soup = BeautifulSoup(data,'lxml')
get_table = soup.find_all(class_="overthrow table_container")

print(len(get_table))

我的代码输出为6，显然不对。我进一步了解到它遗漏的表是高级统计报告标题下面的两个表。

我还想指出，因为我认为这可能是解析器的问题，我尝试直接从网站使用html.parser和html.parser / lxml（而不是文本文件I＆＃39 ; m在示例代码中使用）所以我不认为它是一个腐败的HTML。

我有一个朋友快速查看它，认为这可能是我自己的一个小疏忽，他能够注意到该网站正在使用旧的IE黑客并在表格前面加上评论标签

我不是100％肯定这就是为什么这不起作用，但我已经搜索了这个问题并且一无所获。我希望有人能指出我正确的方向。

Answer 1

最后一个表是由js加载的，但正如你所注意到的那样，它们也在静态html中，在注释标记内。如果您搜索xsl:copy个对象，则可以使用bs4获取它们。

Comment

或者您可以使用selenium，但它比import requests from bs4 import BeautifulSoup, Comment url = 'https://www.hockey-reference.com/boxscores/201610130TBL.html' data = requests.get(url).text soup = BeautifulSoup(data,'lxml') get_table = soup.find_all(class_="overthrow table_container") comment = soup.find(text=lambda text:isinstance(text, Comment) and 'table_container' in text) get_table += BeautifulSoup(comment.string,'lxml').find_all(class_="overthrow table_container") print(len(get_table))或urllib重得多。

requests

beautifulsoup find_all找不到所有

1 个答案: