如何选择没有HTML标记的文本

时间:2015-04-01 19:02:38

标签: python html xpath web-scraping lxml

我正在使用Web scraper(使用Python),所以我有一大块HTML,我试图从中提取文本。其中一个片段看起来像这样:

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

我想从这个类中提取文本。现在,我可以使用

的内容
//p[@class='something')]//text()

但这会导致每个文本块最终成为一个单独的结果元素,如下所示:

(This class has some ,text, and a few ,links, in it.)

所需的输出将包含一个元素中的所有文本,如下所示:

This class has some text and a few links in it.

有没有简单或优雅的方法来实现这一目标?

修改:这是产生上述结果的代码。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']//text()"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item)

3 个答案:

答案 0 :(得分:3)

您可以在XPath中使用normalize-space()。然后

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"

tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)

将产生

This class has some text and a few links in it.

答案 1 :(得分:1)

您可以在lxml元素上调用.text_content(),而不是使用XPath获取文本。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item.text_content())

答案 2 :(得分:0)

原始代码上的替代单行:使用带有空字符串分隔符的join

print("".join(query_results))