我有一些非常混乱的HTML标签,想要提取段落信息,没有HTML,但是我发现我只能得到第一段。例如,HTML看起来像:
<p>BLAH BLAH<strong><nobr><strong>people</strong></nobr></strong>'s work <strong>"Blah <nobr><strong><span style="font-size:14pt"><strong>blah</strong></span></strong></nobr> and <nobr><strong><span style="font-size:14pt"><strong>Nothing</strong></span></strong></nobr> quote"</strong>lalal</p>
<p>More text<strong><nobr><strong>More text</strong></nobr></strong> blah blah</p>
我正在尝试:
converted = html.fromstring(body)
para = converted.xpath('//*[starts-with(name(), "p")]')
并循环使用:
string_content = ''
for p in para:
if p.text is not None:
string_content += ' ' + p.text
但是我只获得一个<p>
元素,这是第一个元素。这段代码似乎无法获取我需要的所有内容,通常只提供第一条信息。
答案 0 :(得分:0)
如果您想获取p
标记内的所有内容,可以执行以下操作:
from lxml import html
body = '<p>BLAH BLAH<strong><nobr><strong>people</strong></nobr></strong>'s work <strong>"Blah <nobr><strong><span style="font-size:14pt"><strong>blah</strong></span></strong></nobr> and <nobr><strong><span style="font-size:14pt"><strong>Nothing</strong></span></strong></nobr> quote"</strong>lalal</p><p>More text<strong><nobr><strong>More text</strong></nobr></strong> blah blah</p>'
converted = html.fromstring(body)
para = converted.xpath('//p')
content = [p.text_content() for p in para if p.text_content()]
content = ' '.join(content)
print content
结果是:
BLAH BLAHpeople's work "Blah blah and Nothing quote"lalal More textMore text blah blah