XPath没有按预期运行

时间:2018-05-12 13:23:42

标签: python xpath

我有一些非常混乱的HTML标签,想要提取段落信息,没有HTML,但是我发现我只能得到第一段。例如,HTML看起来像:

   <p>BLAH BLAH<strong><nobr><strong>people</strong></nobr></strong>&#39;s work <strong>&quot;Blah <nobr><strong><span style="font-size:14pt"><strong>blah</strong></span></strong></nobr> and <nobr><strong><span style="font-size:14pt"><strong>Nothing</strong></span></strong></nobr> quote&quot;</strong>lalal</p>

<p>More text<strong><nobr><strong>More text</strong></nobr></strong> blah blah</p>

我正在尝试:

converted = html.fromstring(body)
para = converted.xpath('//*[starts-with(name(), "p")]')

并循环使用:

string_content = ''
for p in para:          
    if p.text is not None:
        string_content += ' ' + p.text

但是我只获得一个<p>元素,这是第一个元素。这段代码似乎无法获取我需要的所有内容,通常只提供第一条信息。

1 个答案:

答案 0 :(得分:0)

如果您想获取p标记内的所有内容,可以执行以下操作:

from lxml import html

body = '<p>BLAH BLAH<strong><nobr><strong>people</strong></nobr></strong>&#39;s work <strong>&quot;Blah <nobr><strong><span style="font-size:14pt"><strong>blah</strong></span></strong></nobr> and <nobr><strong><span style="font-size:14pt"><strong>Nothing</strong></span></strong></nobr> quote&quot;</strong>lalal</p><p>More text<strong><nobr><strong>More text</strong></nobr></strong> blah blah</p>'

converted = html.fromstring(body)
para = converted.xpath('//p')

content = [p.text_content() for p in para if p.text_content()]
content = ' '.join(content)
print content

结果是:

BLAH BLAHpeople's work "Blah blah and Nothing quote"lalal More textMore text blah blah