LXML在第一个嵌套标记处剪切文本

时间:2018-06-02 09:45:17

标签: lxml

请查看此代码:

# -*- coding: utf-8 -*-
from lxml import etree
html_fragment = "<body><p>This is html, you can <a href='wikpedia'>learn more</a> on the wikipedia page</p></body>"

tree = etree.fromstring(html_fragment, etree.HTMLParser())

for x in tree.findall(".//p") :
    print(x.text)

此印刷品:

This is html, you can 

它会删除a标记之前的文字。如何获取p代码的所有文字?

1 个答案:

答案 0 :(得分:0)

找到解决方案:必须使用.text_content()而不是.text

official doc of lxml

相关问题