Question

对于网络爬虫来说是新手，所以我感谢所有帮助。我正在尝试建立一个模型，该模型从找到的NHL参考表中提取值这里： https://www.hockey-reference.com/leagues/NHL_2019.html#

我仅尝试提取与“团队统计信息”表有关的值，该表包含团队的汇总数据。我正在取得一些进展，但在尝试获取每个团队的行数据并将其存储以供将来计算时遇到困难。到目前为止，这是我的代码：

from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.hockey-reference.com/leagues/NHL_2019.html"
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")


all_stats = soup.find('div', {'id': 'all_stats'})
print(all_stats)

使用此代码，我可以HTML格式查看所需的行信息，但是任何尝试提取该数据的尝试都会导致找到“无”。我想我必须分配每个团队并给td值一个变量，以便将来我可以调用它。我需要收集30行数据。

感谢您的帮助，乔治

Answer 1

原因是Team Statistics表在注释行中，因此您不对其进行解析。在这种情况下，您可以像这样使用Comment中的bs4：

from bs4 import BeautifulSoup , Comment
from urllib import urlopen


search_url = 'https://www.hockey-reference.com/leagues/NHL_2019.html#'

page = urlopen(search_url)
soup = BeautifulSoup(page, "html.parser")

table = soup.findAll('table') ## html part with no comment
table_with_comment = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in table_with_comment]
## print table_with_comment  print all comment line

for c in table_with_comment:
    a = BeautifulSoup(c, "html.parser")
    teams = a.findAll('td', attrs={'class':'left'}) # Team
    values = a.findAll('td', attrs={'class':'right'}) #stats

    for getvalues in values:

        print getvalues.text
    for gettextinElement in teams:
        print gettextinElement.text

输出： 对于统计信息：

27.1
62
47
11
4
98
.790
239
162
5
1
1.26
-0.05
6.47
172
131
61 ..UP TO END

对于团队：

Tampa Bay Lightning
Calgary Flames
Boston Bruins
San Jose Sharks
New York Islanders
Toronto Maple Leafs
Winnipeg Jets
Nashville Predators
Washington Capitals
Columbus Blue Jackets .. UP TO END

Answer 2

@Omer Tekbiyik答案的一种变体，该变体也会将数据放入数据框：

from bs4 import BeautifulSoup as bs4, Comment
import requests
import pandas as pd

url = "https://www.hockey-reference.com/leagues/NHL_2019.html#"

res= requests.get(url)
soup = bs4(res.content, 'lxml')
table = soup.findAll('table') 
table_with_comment = soup.findAll(text=lambda text:isinstance(text, Comment))
my_table = pd.read_html(table_with_comment[16])
my_table

输出是带有“团队统计”表的数据框；从这里可以在任何熊猫数据框上运行任何东西。

BeautifulSoup可以抓取表格数据并存储为值以供将来计算

2 个答案: