!--This is the first table from where i get 4 id's (abc1---abc4) which i need to match with the table below and get the required data--!
<table width="100%" border="0" class=""BigClass">
<tbody>..</tbody>
</table>
!--This is the second table --!
<table width="100%" border="0" class=""BigClass">
<tbody>
<tr align="left">
<td valign="top" colspan="2">
<strong> 1.
First Topic
</strong>
<a name="abc1" id="abc1"></a>
</td>
</tr>
!--This is the place where the first speaker and his/her text comes---!
<tr align="left">
<td style="text-align:justify;line-height:2;padding-right:10px;" colspan="2">
<strong> " First Speaker " </strong>
<br>
" Some Text "
</td>
</tr>
!--This is where the second speaker comes in---!
<tr align="left">
<td style="text-align:justify;line-height:2;padding-right:10px;" colspan="2">
<strong> " Second Speaker " </strong>
<br>
" Some Text "
</td>
</tr>
<tr><td colspan="2"><br></td></tr>
<tr><td colspan="2"><br></td></tr>
!--Then here comes the row with another id--!
<tr align="left">
<td valign="top" colspan="2">
<strong> 2.
Second Topic
</strong>
<a name="abc2" id="abc2"></a>
</td>
</tr>
!--Just like before, this will also have set of speakers who have some text--!
我有两个具有相同类名的表,即BigClass。从第一个表中我提取了4个id,分别是abc1,abc2,abc3,abc4。 现在我想检查一下这个id是否存在于第二个表中(它是) 在它与第二个表中的ID匹配后,我想提取扬声器和那些扬声器的文本。 您可以看到我想要提取数据的第二个表rom的代码结构。
答案 0 :(得分:0)
提取说话人和文本信息似乎最好的方法是提取列表中的所有ID以及另一个列表中的所有说话者信息。然后只需交叉引用所需的ID并获得相应的扬声器信息。
我在这里创建一个字典,其中键为id,值为扬声器信息。我发现扬声器信息的条件是td字段在包含说话者信息的所有字段中定义了样式属性。
要从HTML中提取信息,我使用的是BeautifulSoup库。
from bs4 import BeautifulSoup
from itertools import izip
soup = BeautifulSoup(open('table.html'))
idList = []
speakerList = []
idsRequired = ['abc1','abc2']
for a in soup.findAll('a'):
if 'id' in a.attrs.keys():
idList.append(a.attrs['id'])
for i in soup.findAll('td'):
if 'style' in i.attrs.keys():
speakerList.append(i.text)
for key,value in izip(idList,speakerList):
if key in idsRequired:
print value
这给出了输出:
" First speaker "
" Some text "
" Second speaker "
" Some text "