我有一个像这样的html结构:
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
这些属性并不总是存在,有时我只能Brand
,其他情况Brand
和Flavoring
。
要废弃这个,我做了一个像这样的代码:
BlendInfo = namedtuple('BlendInfo', ['brand', 'type', 'contents', 'flavoring'])
stats_rows = soup.find('table', id='stats').find_all('tr')
bi = BlendInfo(brand = stats_rows[1].td.get_text(),
type = stats_rows[2].td.get_text(),
contents = stats_rows[3].td.get_text(),
flavoring = stats_rows[4].td.get_text())
但是正如预期的那样,当表排序不同(在品牌之前输入)或某些行缺失(没有内容)时,它会失败并带有索引输出边界(或者变得非常混乱)。
有没有比这更好的方法:
向我提供带有标题字符串&#39;品牌&#39;
的行中的数据答案 0 :(得分:2)
绝对有可能。看看这个:
from bs4 import BeautifulSoup
html_content='''
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html_content,"lxml")
for item in soup.find_all(class_='info')[0].find_all("th"):
header = item.text
rows = item.find_next_sibling().text
print(header,rows)
输出:
Brand 2 Guys Smoke Shop
Blend Type Aromatic
Contents Black Cavendish, Virginia
Flavoring Other / Misc
答案 1 :(得分:1)
这会为你建立一个字典:
from BeautifulSoup import BeautifulSoup
valid_headers = ['brand', 'type', 'contents', 'flavoring']
t = """<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>"""
bs = BeautifulSoup(t)
results = {}
for row in bs.findAll('tr'):
hea = row.findAll('th')
if hea.strip().lstrip().lower() in valid_headers:
val = row.findAll('td')
results[hea[0].string] = val[0].string
print results