我有一些看起来像这样的HTML:
<tr>
<td>some text</td>
<td>some other text</td>
<td>some <b>problematic</b> other <br /> text</td>
</tr>
和一些python试图获取标签的值并打印每个内部值:
soup = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
for row in soup.findAll('tr'):
print repr(row) # this prints the whole 'tr' element text just fine.
for col in row.contents:
print col.string
所以全文正确打印捕获的html,但'col'为最后一个元素打印None:
some text
some other text
None
我不熟悉BeatifulSoup或python,但似乎最后一个元素的内部标签导致解析问题?
由于
答案 0 :(得分:0)
您可以升级到BeautifulSoup版本4并使用.stripped_strings
:
soup = BeautifulSoup(data)
for row in soup.find_all('tr'):
print '\n'.join(row.stripped_strings)
在BeautifulSoup 3中,您需要搜索所有包含的文本:
for row in soup.findAll('tr'):
print '\n'.join(el.strip() for row.findAll(text=True) if el.strip())