Question

我正在尝试从我从事的项目“篮球参考”中提取篮球运动员的数据。在B-R上，播放器页面上有多个数据表，我想抓取所有这些表。但是，当我尝试从页面中获取表时，它只为我提供了表标签的第一个实例，即只有第一个表。

我搜索了html，发现在表标签的第一个实例之外，所有表标签都在注释块下。当我解析其父标记并尝试搜索包含表信息的子标记时，它什么也不返回。 Here is a link to an example page，这是我的代码：

url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')

per_36 = soup.find(id='all_per_minute')
table = per_36.find('table')

这什么都不返回，但是，如果我改为查找第一个表，它将返回内容。我不知道发生了什么，但是我认为这可能与这些注释块有关？

Answer 1

要通过BeautifulSoup抓取评论，可以使用以下脚本：

base64 --decode

打印：

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.basketball-reference.com/players/j/jamesle01.html'
get = requests.get(url)
soup = BeautifulSoup(get.text, 'html.parser')

pl = soup.select_one('#all_per_minute .placeholder')
comments = pl.find_next(string=lambda text: isinstance(text, Comment))

soup = BeautifulSoup(comments, 'html.parser')

rows = []
for tr in soup.select('tr'):
    rows.append([td.get_text(strip=True) for td in tr.select('td, th')])

for row in rows:
    print(''.join('{: ^7}'.format(td) for td in row))

尝试抓取具有多个数据表的网页，但是仅提取第一个表？

1 个答案: