Question

我使用BeautifulSoup来解析html页面。我需要处理页面中的第一个表。该表包含几行。然后每一行都包含一些＆＃39;＆＃39;标签和其中一个＆＃39;＆＃39;标签有一个＆＃39; img＆＃39;标签。我想获得该表中的所有信息。但是如果我打印那张桌子，我就不会得到与“img”相关的任何数据。标签。

我使用soap.findAll（＆＃34; table＆＃34;）来获取所有表，然后选择第一个表进行处理。 html看起来像这样：

<table id="abc"
  <tr class="listitem-even">
    <td class="listitem-even">
      <table border = "0"> <tr> <td class="gridcell">
               <img id="img_id" title="img_title" src="img_src" alt="img_alt" /> </td> </tr>
      </table>
    </td>
    <td class="listitem-even"
      <span>some_other_information</span>
    </td>
  </tr>
</table>

如何获取表格中的所有数据，包括＆＃39; img＆＃39;标签？谢谢，

Answer 1

您有一个嵌套表，因此在解析tr / td / img标记之前，您需要检查您在树中的位置。

from bs4 import BeautifulSoup
f = open('test.html', 'rb')
html = f.read()
f.close()
soup = BeautifulSoup(html)

tables = soup.find_all('table')

for table in tables:
     if table.find_parent("table") is not None:
         for tr in table.find_all('tr'):
                 for td in table.find_all('td'):
                         for img in td.find_all('img'):
                                 print img['id']
                                 print img['src']
                                 print img['title']
                                 print img['alt']

根据您的示例返回以下内容：

img_id
img_src
img_title
img_alt

使用BeautifulSoup在python中解析带有img标记的表

1 个答案: