Question

我在使用BeautifulSoup解析表数据时遇到问题，但我已经尝试了很多找到here，here和here的解决方案。我讨厌重新问，但也许我的问题是独特的，这就是为什么上述解决方案无法奏效，或者我只是一个白痴。

所以我试图从water.weather.gov检索任何给定河流的洪水触发器。我使用Mississippi river data因为它拥有最活跃的测量站。每个电台都有4个阶段触发器，我想要获得：动作，洪水，中等和专业。实际上，当有数值时，我已经能够提取那些的表格数据，但是如果表格数据是＆＃34; 不可用＆＃34 ;该行被跳过，因此当我将值放在正确的阶段时，它们不会与适当的站点触发器对齐。

我尝试提取的表格数据如下所示：

<div class="box_square"> <b><b>Flood Categories (in feet)</b><br> </b> <table width="150" cellspacing="0" cellpadding="0" border="0"> <tbody> <tr><td nowrap="">Not Available</td></tr> </tbody> <div class="box_square"> <b><b>Flood Categories (in feet)</b><br> </b> <table width="150" cellspacing="0" cellpadding="0" border="0"> <tbody> <tr style="display:'';line-height:20px;background-color:#CC33FF;color:black"> <td scope="col" nowrap="">Major Flood Stage:</td> <td scope="col">18</td> </tr> <tr style="display:'';line-height:20px;background-color:#FF0000;color:white"> <td scope="col" nowrap="">Moderate Flood Stage:</td> <td scope="col">15</td> </tr> <tr style="display:'';line-height:20px;background-color:#FF9900;color:black"> <td scope="col" nowrap="">Flood Stage:</td> <td scope="col">13</td> </tr> <tr style="display:'';line-height:20px;background-color:#FFFF00;color:black"> <td scope="col" nowrap="">Action Stage:</td> <td scope="col">12</td> </tr> <tr style="display:none;line-height:20px;background-color:#906320;color:white"> <td scope="col" nowrap="">Low Stage (in feet):</td> <td scope="col">-9999</td> </tr> </tbody> </table><br></div>

最后低级是不必要的，我已将其过滤掉了。以下是我将使用适当的值填充alert_list的代码，但没有必要的不可用：

alert_list = [] alert_values = [] alerts = soup.findAll('td', attrs={'scope':'col'}) for alert in alerts: alert_list.append(alert.text.strip()) a_values = alert_list[1::2] alert_list.clear() major_lvl = a_values[::5] moderate_lvl = a_values[1::5] flood_lvl = a_values[2::5] action_lvl = a_values[3::5]

和结果：

>>> major_lvl ['18', '26', '0', '11', '0', '17', '17', '18', '0', '683', '16', '0', '20', '16', '18', '665', '661', '18', '651', '645', '15.5', '636', '20', '631', '22', '21', '20.5', '21.5', '20', '20', '20.5', '13.5', '18', '18', '20', '18.5', '17', '14', '18', '19', '25', '25', '25', '26', '25', '24', '22', '25', '33', '34', '29', '34', '40', '40', '0', '0', '0', '42', '42', '0', '0', '0', '0', '0', '44', '47', '43', '35', '46', '52', '55', '0', '44', '57', '50', '57', '64', '40', '34', '26', '20']

我刚才注意到不可用标记不被抓取的原因是因为它位于 tr 标记下，而不是 TD 即可。如何添加此值以使我的值排列？

Answer 1

如果您只对scope=col的列感兴趣，可以使用css selector来完美地执行此操作。

In [24]: soup = BS(html, "html.parser")

In [25]: major_list = [td.get_text(strip=True) for td in soup.select("tr > td:nth-of-type(2)[scope=col]")[:-1]]

In [26]: major_list
Out[26]: ['18', '15', '13', '12']

要获取其列旁边的所有行，您需要先select行，然后每行检索列中的数据。

for tr in soup.select("div[class=box_square] tr"):
    print([td.get_text(strip=True) for td in tr.find_all("td")])

Answer 2

你也可以用一个功能来做。在您的情况下，只有您想要的行具有style属性。您可以浏览所有代码，只接受tr且style的代码。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('weather.htm'), 'lxml')
>>> def acceptable(tag):
...     return tag.name=='tr' and tag.has_attr('style')
... 
>>> for tag in soup.find_all(acceptable):
...     tag.text.replace('\n', '').split(':')
...     
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']

编辑，以回应评论：

省略acceptable并使用此功能。

>>> for tag in soup.find_all('tr'):
...     if tag.has_attr('style'):
...         tag.text.replace('\n', '').split(':')
...     elif 'not available' in tag.text.lower():
...         tag.text
...     else:
...         pass
...     
'Not Available'
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']

使用BeautifulSoup解析多个表

2 个答案: