我正在尝试使用美丽的汤解析行表并保存dict中每行的值。
一个打嗝是表的结构有一些行作为节标题。因此,对于具有类“header”的任何行,我想定义一个名为“section”的变量。这就是我所拥有的,但它不起作用,因为它说['class'] TypeError: string indices must be integers
这就是我所拥有的:
for i in credits.contents:
if i['class'] == 'header':
section = i.contents
DATA_SET[section] = {}
else:
DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents
数据示例:
<table class="credits">
<tr class="header">
<th colspan="3"><h1>HEADER NAME</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr class="header">
<th colspan="3"><h1>HEADER NAME</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
</table>
答案 0 :(得分:3)
这是一个解决方案,稍微调整您的示例数据,以便结果更清晰:
from BeautifulSoup import BeautifulSoup
from pprint import pprint
html = '''<body><table class="credits">
<tr class="header">
<th colspan="3"><h1>HEADER 1</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA11</p></td>
<td class="data_point_2"><p>DATA12</p></td>
<td class="data_point_3"><p>DATA12</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA21</p></td>
<td class="data_point_2"><p>DATA22</p></td>
<td class="data_point_3"><p>DATA23</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA31</p></td>
<td class="data_point_2"><p>DATA32</p></td>
<td class="data_point_3"><p>DATA33</p></td>
</tr>
<tr class="header">
<th colspan="3"><h1>HEADER 2</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA11</p></td>
<td class="data_point_2"><p>DATA12</p></td>
<td class="data_point_3"><p>DATA13</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA21</p></td>
<td class="data_point_2"><p>DATA22</p></td>
<td class="data_point_3"><p>DATA23</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA31</p></td>
<td class="data_point_2"><p>DATA32</p></td>
<td class="data_point_3"><p>DATA33</p></td>
</tr>
</table></body>'''
soup = BeautifulSoup(html)
rows = soup.findAll('tr')
section = ''
dataset = {}
for row in rows:
if row.attrs:
section = row.text
dataset[section] = {}
else:
cells = row.findAll('td')
for cell in cells:
if cell['class'] in dataset[section]:
dataset[section][ cell['class'] ].append( cell.text )
else:
dataset[section][ cell['class'] ] = [ cell.text ]
pprint(dataset)
产地:
{u'HEADER 1': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
u'data_point_3': [u'DATA12', u'DATA23', u'DATA33']},
u'HEADER 2': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
u'data_point_3': [u'DATA13', u'DATA23', u'DATA33']}}
编辑适应您的解决方案
您的代码很整洁,只有几个问题。您可以在contents
或text
的地方使用findAll
- 我修复了以下内容:
soup = BeautifulSoup(html)
credits = soup.find('table')
section = ''
DATA_SET = {}
for i in credits.findAll('tr'):
if i.get('class', '') == 'header':
section = i.text
DATA_SET[section] = {}
else:
DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents
print DATA_SET
请注意,如果连续的单元格具有相同的data_point
类,则连续的行将替换之前的行。我怀疑这不是你的真实数据集中的问题,但这就是为什么你的代码会返回这个,缩写,结果:
{u'HEADER 2': {'data_point_2': [u'DATA32'],
'data_point_3': [u'DATA33'],
'data_point_1': [u'DATA31']},
u'HEADER 1': {'data_point_2': [u'DATA32'],
'data_point_3': [u'DATA33'],
'data_point_1': [u'DATA31']}}