Question

我有一个在XML标记中包含HTML的文件，我希望将HTML作为原始文本，而不是将其解析为XML标记的子项。这是一个例子：

import xml.etree.ElementTree as ET
root = ET.fromstring("<root><text><p>This is some text that I want to read</p></text></root>")

如果我尝试：

root.find('text').text

它不返回任何输出

但是root.find（'text / p'）。text将返回没有标签的段落文本。我希望text标签中的所有内容都是原始文本，但我无法弄清楚如何获得它。

Answer 1

Your solution是合理的。元素对象是子列表。元素对象的.text属性仅与不属于其他（嵌套）元素的事物（通常是文本）相关。

您的代码中有一些需要改进的地方。在Python中，字符串连接是一项昂贵的操作。最好构建子串列表并在以后加入它们 - 像这样：

output_lst = []  
for child in root.find('text'):
    output_lst.append(ET.tostring(child, encoding="unicode"))

output_text = ''.join(output_lst)

该列表也可以使用Python list comprehension 构造，因此代码将更改为：

output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]  
output_text = ''.join(output_lst)

.join可以使用任何产生字符串的迭代。这样，列表不需要提前构建。相反，可以使用生成器表达式（即在列表推导的[]内可以看到的内容）：

output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))

单行可以格式化为更多行，使其更具可读性：

output_text = ''.join(ET.tostring(child, encoding="unicode")
                      for child in root.find('text'))

Answer 2

通过使用ET.tostring将我的text标签的所有子元素附加到字符串，我能够得到我想要的东西：

output_text = ""    
for child in root.find('text'):
    output_text += ET.tostring(child, encoding="unicode")

>>>output_text
>>>"<p>This is some text that I want to read</p>"

如何使用ElementTree将HTML标记解析为原始文本

2 个答案: