Question

网站具有以下html元素：

<td class="right " data-stat="week_num">1</td>
<td class="right " data-stat="week_num">2</td>
<!-- etc -->

我能够使用以下代码捕获这些元素：

import requests
from bs4 import BeautifulSoup
    
url = "https://www.pro-football-reference.com/players/H/HopkDe00.htm"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

for item in soup.find_all(attrs={'data-stat':'week_num'}):
  print(item)

这将毫无问题地获取请求的html元素。

我正在尝试检索另一组元素：

<tr id = 'stats.111' data-row='0'>
<tr id = 'stats.112' data-row='1'>
<!-- etc -->

要得到这些，我认为只需要稍微更改上面的代码即可。但这没有用。请注意，在下面，我尝试将'True'作为字符串传递，而只是True，程序运行了两次，但是没有任何元素被打印到控制台。

for item in soup.find_all(attrs={'data-row':True}): # attempted to get all elements with the `data-row` attribute, this returned `None`. 
    print(item)

然后，我尝试仅获取一个元素来测试是否可以使用相同的代码来做到这一点。

for item in soup.find_all(attrs={'data-row':'1'}): # target just the <tr data-row='1'> element  
    print(item)

但是这也什么也没返回。如何使用data-row属性来定位这组元素？

Answer 1

data-row属性是由JavaScript动态添加的，因此需要针对不同的行。例如，使用id="stats"获取表下的所有行：

import requests
from bs4 import BeautifulSoup


url = 'https://www.pro-football-reference.com/players/H/HopkDe00.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for row in soup.select('table#stats tbody tr'):
    tds = [td.get_text(strip=True) for td in row.select('td, th')]
    print(*tds)

打印：

2020-09-13 1 ARI @ SFO W 24-20 * 16 14 151 10.79 0 87.5% 9.44 0 0 0 0 0 0 0 77 94% 0 0% 0 0%
2020-09-20 2 ARI  WAS W 30-15 * 9 8 68 8.50 1 88.9% 7.56 1 0 0 0 0 0 0 75 97% 0 0% 0 0%
2020-09-27 3 ARI  DET L 23-26 * 12 10 137 13.70 0 83.3% 11.42 0 0 0 0 0 0 0 61 94% 0 0% 0 0%
2020-10-04 4 ARI @ CAR L 21-31 * 9 7 41 5.86 0 77.8% 4.56 0 0 0 0 0 0 0 54 95% 0 0% 0 0%

...and so on.

用漂亮的汤解析数据，定位数据属性

1 个答案: