如何使用Beautifulsoup提取html表

时间:2014-05-22 14:25:35

标签: python html html-parsing beautifulsoup parent

以下面的html片段为例:

>>>soup
<table>
<tr><td class="abc">This is ABC</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>

<table>
<tr><td class="efg">This is EFG</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>

如果我只能通过其表数据类找到我的欲望表,

>>>soup.findAll("td",{"class":"abc"})
[<td class="abc">This is ABC</td>]

如何提取整个表格如下?

<table>
<tr><td class="abc">This is ABC</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>

1 个答案:

答案 0 :(得分:1)

使用parent获取td代码find_parent()

soup.find("td", {"class":"abc"}).find_parent('table')

演示:

>>> from bs4 import BeautifulSoup
>>> data = """
... <div>
...     <table>
...         <tr><td class="abc">This is ABC</td>
...         </tr>
...         <tr><td class="firstdata"> data1_xxx </td>
...         </tr>
...     </table>
... 
...     <table>
...         <tr><td class="efg">This is EFG</td>
...         </tr>
...         <tr><td class="firstdata"> data1_xxx </td>
...         </tr>
...     </table>
... </div>
... """
>>> soup = BeautifulSoup(data)
>>> print soup.find("td", {"class":"abc"}).find_parent('table')
<table>
<tr><td class="abc">This is ABC</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>