Question

我有这个xml文件：

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

，我需要对其进行解析以提取其文本。我为此使用xml.etree.ElementTree（see documentation）。

这是我用来解析和浏览文件的简单代码：

import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()

def explore_element(element):
    print(element.tag)
    print(element.attrib)
    print(element.text)
    for child in element:
        explore_element(child)

explore_element(root)

除了元素<P>没有完整的文本外，其他操作均按预期进行。特别是，我似乎丢失了“但还有更多内容”（<P>元素之后的<af>中的文本）。

该xml文件是给定的，因此即使有推荐的更好的编写方式（并且有太多尝试手动修复的方式），我也无法对其进行改进。

有没有办法获取所有文本？

我的代码产生的输出（如果有帮助的话）是这样的：

do
{'title': 'Example document', 'date': 'today'}

db
{'descr': 'First level'}

P 
{}
        Some text here that

af
{'d': 'reference 1'}
continues

编辑：

被接受的答案使我意识到我没有像应该阅读的那样仔细阅读文档。有相关问题的人也可能会发现 .tail 有用。

Answer 1

使用BeautifulSoup：

list_test.xml：

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

然后：

from bs4 import BeautifulSoup

with open('list_test.xml','r') as f:
    soup = BeautifulSoup(f.read(), "html.parser")
    for line in soup.find_all('p'):
         print(line.text)

输出：

Some text here that
continues
but then has some more stuff.

编辑：

使用elementree：

import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

输出：

Some text here that continues but then has some more stuff.

当文本之间的元素

1 个答案: