我有这个xml文件:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
,我需要对其进行解析以提取其文本。我为此使用xml.etree.ElementTree
(see documentation)。
这是我用来解析和浏览文件的简单代码:
import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()
def explore_element(element):
print(element.tag)
print(element.attrib)
print(element.text)
for child in element:
explore_element(child)
explore_element(root)
除了元素<P>
没有完整的文本外,其他操作均按预期进行。特别是,我似乎丢失了“但还有更多内容”(<P>
元素之后的<af>
中的文本)。
该xml文件是给定的,因此即使有推荐的更好的编写方式(并且有太多尝试手动修复的方式),我也无法对其进行改进。
有没有办法获取所有文本?
我的代码产生的输出(如果有帮助的话)是这样的:
do
{'title': 'Example document', 'date': 'today'}
db
{'descr': 'First level'}
P
{}
Some text here that
af
{'d': 'reference 1'}
continues
编辑:
被接受的答案使我意识到我没有像应该阅读的那样仔细阅读文档。有相关问题的人也可能会发现 .tail 有用。
答案 0 :(得分:2)
使用BeautifulSoup:
list_test.xml:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
然后:
from bs4 import BeautifulSoup
with open('list_test.xml','r') as f:
soup = BeautifulSoup(f.read(), "html.parser")
for line in soup.find_all('p'):
print(line.text)
输出:
Some text here that
continues
but then has some more stuff.
编辑:
使用elementree:
import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
输出:
Some text here that continues but then has some more stuff.