Question

我想编写一个代码片段，它会在下面所有三个实例（包括代码标记）中的lxml中获取<content>标记内的所有文本。我已经尝试了tostring(getchildren())，但这会遗漏标签之间的文字。我没有太多运气在API中搜索相关功能。你能救我一下吗？

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Answer 1

text_content()能做你需要的吗？

Answer 2

只需使用node.itertext()方法，如：

 ''.join(node.itertext())

Answer 3

尝试：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

示例：

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

制作：'\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Answer 4

以下使用python生成器的代码段完美无缺，效率很高。

''.join(node.itertext()).strip()

Answer 5

albertov stringify-content的一个版本解决了hoju报告的bugs：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

Answer 6

import urllib2
from lxml import etree
url = 'some_url'

获取网址

test = urllib2.urlopen(url)
page = test.read()

获取包含表格标签

内的所有html代码

tree = etree.HTML(page)

xpath选择器

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res是表的html代码这对我有用。

因此您可以使用tostring（）

使用xpath_text（）和标记（包括其内容）提取标记内容

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content")

或text = tree.xpath（“// content / text（）”）

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

使用strip方法的最后一行不是很好，但它只是工作

Answer 7

以这种方式定义stringify_children可能不那么复杂：

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

或一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

基本原理与this answer中的相同：将子节点的序列化保留为lxml。在这种情况下tail的{{1}}部分并不感兴趣，因为它是＆＃34;背后＆＃34;结束标记。请注意，node参数可能会根据个人需要进行更改。

另一种可能的解决方案是序列化节点本身，然后剥离开始和结束标记：

encoding

这有点可怕。只有当def stringify_children(node): s = etree.tostring(node, encoding='unicode', with_tail=False) return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]没有属性时，此代码才是正确的，我认为即便如此，任何人都不会想要使用它。

Answer 8

回应上面@ Richard的评论，如果你将stringify_children修补为：

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

似乎避免了他所指的重复。

Answer 9

最简单的代码片段之一，实际上对我有用，并且根据http://lxml.de/tutorial.html#using-xpath-to-find-text的文档是

etree.tostring(html, method="text")

其中etree是一个节点/标签，其完整文本，您正在尝试阅读。看哪，它没有摆脱脚本和样式标签。

Answer 10

给出答案后，快速增强功能。如果要清除内部文本：

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

Answer 11

这是一个有效的解决方案。我们可以使用父标记获取内容，然后从输出中剪切父标记。

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element必须有Element类型。

请注意：如果您需要文字内容（不是文字中的html实体），请将html_entities参数设为False。

Answer 12

lxml有一个方法：

node.text_content()

Answer 13

如果这是一个标签，您可以尝试：

node.values()

Answer 14

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

获取lxml中标记内的所有文本

14 个答案: