Question

如何使用Python获取XML文档的所有文本内容 - like this Ruby/hpricot example。

我想用单个空格替换XML标记。

Answer 1

使用stdlib xml.etree

import xml.etree.ElementTree as ET

tree = ET.parse('sample.xml') 
print(ET.tostring(tree, encoding='utf-8', method='text'))

Answer 2

我真的很喜欢BeautifulSoup，如果我们可以避免使用它，我宁愿不在HTML上使用正则表达式。

改编自：[this StackOverflow Answer]，[BeautifulSoup documentation]

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)

当然，您可以（并且应该）使用BeautifulSoup来浏览您正在寻找的页面。

Answer 3

使用内置的sax解析框架，不需要像BeautifulSoup这样的外部库的解决方案：

from xml import sax

class MyHandler(sax.handler.ContentHandler):
    def parse(self, filename):
        self.text = []
        sax.parse(filename, self)
        return ''.join(self.text)

    def characters(self, data):
        self.text.append(data)

result = MyHandler().parse("yourfile.xml")

如果您需要在文本中保留所有空格，请在处理程序类中定义ignorableWhitespace方法，方法与定义characters的方式相同。

Answer 4

这个问题实际上是an example in the lxml tutorial，它建议使用以下XPath表达式之一将文档中的所有文本内容作为字符串列表获取：

root.xpath("string()")
root.xpath("//text()")

然后，您需要将这些文本位一起加入一个大字符串，str.join可能使用str.strip来删除每个位上的前导和尾随空格并忽略完全由空格构成：

>>> from lxml import etree
>>> root = etree.fromstring("""
... <node>
...   some text
...   <inner_node someattr="someval">   </inner_node>
...   <inner_node>
...     foo bar
...   </inner_node>
...   yet more text
...   <inner_node />
...   even more text
... </node>
... """)
>>> bits_of_text = root.xpath('//text()')
>>> print(bits_of_text)  # Note that some bits are whitespace-only
['\n  some text\n  ', '   ', '\n  ', '\n    foo bar\n  ', '\n  yet more text\n  ', '\n  even more text\n']
>>> joined_text = ' '.join(
...     bit.strip() for bit in bits_of_text
...     if bit.strip() != ''
... )
>>> print(joined_text)
some text foo bar yet more text even more text

请注意，顺便说一句，如果您不想在文本位之间插入空格，那么

etree.tostring(root, method='text', encoding='unicode')

如果你正在处理 HTML 而不是 XML ，并且使用lxml.html来解析你的HTML，你可以调用{{1你的根节点的方法来获取它包含的所有文本（虽然，再次，不会插入空格）：

.text_content()

Answer 5

你问了lxml：

reslist = list(root.iter())
result = ' '.join([element.text for element in reslist])

或者：

result = ''
for element in root.iter():
    result += element.text + ' '
result = result[:-1] # Remove trailing space

从XML文档中获取所有文本？

5 个答案: