Question

<tr>
<td nowrap> good1 </td>
<td class = "td_left" nowrap=""> 1 </td>
</tr>

<tr0>
<td nowrap> good2 </td>
<td class = "td_left" nowrap="">  </td>
</tr0>

如何使用python解析呢？请帮忙。我希望得到结果为列表['good1'，1，'good2'，无]

Answer 1

查找所有tr代码并从中获取所有td：

from bs4 import BeautifulSoup


page = """<tr>
<td nowrap> good1 </td>
<td nowrap class = "td_left"> 1 </td>
</tr>

<tr>
<td nowrap> good2 </td>
<td nowrap class = "td_left"> 2 </td>
</tr>"""

soup = BeautifulSoup(page)
rows = soup.body.find_all('tr')
print [td.text.strip() for row in rows for td in row.find_all('td')]

打印：

[u'good1', u'1', u'good2', u'2']

注意，strip()有助于摆脱前导和尾随空格。

希望有所帮助。

BeautifulSoup帮助，如何从html文件中的不正确的标签文本中提取内容？

1 个答案: