如何解析html文件中的名称和值

时间:2015-08-12 14:29:18

标签: python beautifulsoup html-parsing

此问题与我的另一个问题How to get next form content in python

有关

我有一些html内容

<tr>
<td><strong>User key: </strong></td>
<td>0200fde8a7f3d1084224962a4e7c54e69ac3f04da6b8</td>
</tr>
<tr>
<td><strong>Institute id: </strong></td>
<td>
      030780ffa3641183273ad548ae09872f9dcf4b0c4267<br/>000d6f0004c468345445535453454341010910830123<br/>4567890a<br/> </td>
</tr>
<tr>
<td><strong>part id:</strong></td>
<td>00ecd01536ff66296f9d572219d7acac02d59b24c6</td>
</tr>
<tr>

我需要解析它并需要输出

User key: 0200fde8a7f3d1084224962a4e7c54e69ac3f04da6b8
Institute id: 030780ffa3641183273ad548ae09872f9dcf4b0c4267000d6f0004c4683454455354534543410109108301234567890a
part id: 00ecd01536ff66296f9d572219d7acac02d59b24c6

我已经通过了http://www.crummy.com/software/BeautifulSoup/bs4/doc/并尝试了一些事情,但没有想到我需要做什么才能获得欲望输出。我是python编程的新手。看看我的尝试

html_doc = """
<tr>
<td><strong>User key: </strong></td>
<td>0200fde8a7f3d1084224962a4e7c54e69ac3f04da6b8</td>
</tr>
<tr>
<td><strong>Institute id: </strong></td>
<td>
      030780ffa3641183273ad548ae09872f9dcf4b0c4267<br/>000d6f0004c468345445535453454341010910830123<br/>4567890a<br/> </td>
</tr>
<tr>
<td><strong>part id:</strong></td>
<td>00ecd01536ff66296f9d572219d7acac02d59b24c6</td>
</tr>
<tr>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')


for link in soup.find_all('strong'):
        print link

1 个答案:

答案 0 :(得分:1)

您可以做一件事,首先查找所有tr标记,然后为每个tr标记找到其中的所有td标记,然后打印其文本。示例 -

>>> for i in soup.findAll('tr'):
...     for tdi in i.findAll('td'):
...             print tdi.text.strip(),
...     print
...
User key: 0200fde8a7f3d1084224962a4e7c54e69ac3f04da6b8
Institute id: 030780ffa3641183273ad548ae09872f9dcf4b0c4267000d6f0004c4683454455354534543410109108301234567890a
part id: 00ecd01536ff66296f9d572219d7acac02d59b24c6