当文本之间的元素

时间:2019-01-31 11:41:06

标签: python xml parsing xml-parsing

我有这个xml文件:

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

,我需要对其进行解析以提取其文本。我为此使用xml.etree.ElementTreesee documentation)。

这是我用来解析和浏览文件的简单代码:

import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()

def explore_element(element):
    print(element.tag)
    print(element.attrib)
    print(element.text)
    for child in element:
        explore_element(child)

explore_element(root)

除了元素<P>没有完整的文本外,其他操作均按预期进行。特别是,我似乎丢失了“但还有更多内容”(<P>元素之后的<af>中的文本)。

该xml文件是给定的,因此即使有推荐的更好的编写方式(并且有太多尝试手动修复的方式),我也无法对其进行改进。

有没有办法获取所有文本?

我的代码产生的输出(如果有帮助的话)是这样的:

do
{'title': 'Example document', 'date': 'today'}

db
{'descr': 'First level'}

P 
{}
        Some text here that

af
{'d': 'reference 1'}
continues

编辑

被接受的答案使我意识到我没有像应该阅读的那样仔细阅读文档。有相关问题的人也可能会发现 .tail 有用。

1 个答案:

答案 0 :(得分:2)

使用BeautifulSoup:

list_test.xml:

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

然后:

from bs4 import BeautifulSoup

with open('list_test.xml','r') as f:
    soup = BeautifulSoup(f.read(), "html.parser")
    for line in soup.find_all('p'):
         print(line.text)

输出:

Some text here that
continues
but then has some more stuff.

编辑:

使用elementree:

import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

输出:

Some text here that continues but then has some more stuff.