BeautifulSoup帮助,如何从html文件中的不正确的标签文本中提取内容?

时间:2014-03-10 05:27:38

标签: python beautifulsoup

<tr>
<td nowrap> good1 </td>
<td class = "td_left" nowrap=""> 1 </td>
</tr>

<tr0>
<td nowrap> good2 </td>
<td class = "td_left" nowrap="">  </td>
</tr0>

如何使用python解析呢?请帮忙。 我希望得到结果为列表['good1',1,'good2',无]

1 个答案:

答案 0 :(得分:0)

查找所有tr代码并从中获取所有td

from bs4 import BeautifulSoup


page = """<tr>
<td nowrap> good1 </td>
<td nowrap class = "td_left"> 1 </td>
</tr>

<tr>
<td nowrap> good2 </td>
<td nowrap class = "td_left"> 2 </td>
</tr>"""

soup = BeautifulSoup(page)
rows = soup.body.find_all('tr')
print [td.text.strip() for row in rows for td in row.find_all('td')]

打印:

[u'good1', u'1', u'good2', u'2']

注意,strip()有助于摆脱前导和尾随空格。

希望有所帮助。