Question

使用python3.4，我试图从xml文件中提取所有文本。我用过：

tree = etree.parse(xmlFile)
notags = etree.tostring(tree, encoding='utf8', method='text')

这删除了所有的xml标签，只给我文本。但结果有3个问题：

“almost square”转向\xe2\x80\x9calmost square\xe2\x80\x9d
<title><tag close=" ">1</tag>Introduction</title> 变成1Introduction 虽然我需要1和介绍之间的空格
引用如：In [<ref labelref="LABEL:C"/>] 变成了In []

在没有这些问题的情况下，没有标签的文本是否有更好的方法？

由于

Answer 1

您也可以序列化为Unicode字符串，而无需声明将unicode函数作为编码传递（或在Py3中传递str），或者名字'unicode'。这会更改一个字节的返回值字符串到未编码的unicode字符串。

如果您想要unicode字符串

encoding='unicode'