我对编程和Python语言完全陌生。我正在尝试解析许多XML,以获取一些数据并将其另存为CSV文件。这是我的代码(根据我在Stack Overflow中看到的不同情况构建的)仅解析目录(路径)中的最后一个文件。我在做什么不好?缩进或代码顺序是否有问题?
代码如下:
import xml.etree.ElementTree as ET
import csv
import os
fields = [
('300$a', 'volume'), ('300$b', 'numero'), ('300$c', 'parte'), ('300$d', 'pag'),
('245$a', 'title-group/article-title[1]'), ('242$a', 'title-group/article-title[2]'), ('242$y', 'lng'),
('024$a', 'article-id[@pub-id-type="doi"]'),
('041$a', 'lng'),
('590$a', 'Art'), ('590$b', 'focus'),
('546$a', 'lng_abstract'),
('520$a', 'abstract/p[1]'), ('520$a', 'abstract/p[2]'), ('520$a', 'abstract/p[3]'),
('Surname_1', 'contrib-group/contrib[1]/name/surname'),
('Given_1', 'contrib-group/contrib[1]/name/given-names'),
('Surname_2', 'contrib-group/contrib[2]/name/surname'),
('Given_2', 'contrib-group/contrib[2]/name/given-names'),
('Surname_3', 'contrib-group/contrib[3]/name/surname'),
('Given_3', 'contrib-group/contrib[3]/name/given-names'),
('Surname_4', 'contrib-group/contrib[4]/name/surname'),
('Given_4', 'contrib-group/contrib[4]/name/given-names')]
path = r'E:\Files\Nueva Carpeta'
for filename in os.listdir(path):
if not filename.endswith('.xml'):
continue
fullname = os.path.join(path, filename)
tree = ET.parse(fullname)
root = tree.getroot()
with open('article-meta.csv', 'w') as f_article:
csv_article_meta = csv.DictWriter(f_article, fieldnames=[field for field, match in fields])
csv_article_meta.writeheader()
for node in tree.iter('article-meta'):
row = {}
for field_name, match in fields:
try:
row[field_name] = node.find(match).text
except AttributeError as e:
row[field_name] = ''
csv_article_meta.writerow(row)
XML如下:
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
.
.
.
<article-meta>
<article-id>S0104-40602017000700027</article-id>
<article-id pub-id-type="doi">10.1590/0104-4060.52923</article-id>
<title-group>
<article-title xml:lang="pt">
<![CDATA[
A inclusão das pessoas com deficiência: panorama inclusivo no ensino superior no Brasil e em Portugal
]]>
</article-title>
<article-title xml:lang="en">
<![CDATA[
Inclusion of people with disabilities: Inclusive panorama in higher education in Brazil and Portugal
]]>
</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>
<![CDATA[ Pereira ]]>
</surname>
<given-names>
<![CDATA[ Carlos Eduardo Candido ]]>
</given-names>
</name>
<xref ref-type="aff" rid="Aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>
<![CDATA[ Albuquerque ]]>
</surname>
<given-names>
<![CDATA[ Cristina Maria Pinto ]]>
</given-names>
.
.
.
</article-meta>
</front>
对不起,我正在学习英语。
答案 0 :(得分:0)
for filename in os.listdir(path):
if not filename.endswith('.xml'):
continue
fullname = os.path.join(path, filename)
在上面的代码中,您遍历所有文件,跳过那些不是XML的文件,但是如果文件确实是XML文件,则您什么也不做! filename
是表示迭代过程中当前项目的变量,因此,当您以后编写os.path.join(path, filename)
时,filename
始终是listdir
中最后一项的值。
这是如何获取所有XML文件路径的粗略草图:
xml_file_paths = [os.path.join(path, curr_f_name) for curr_f_name in os.listdir(path) if curr_f_name.endswith('.xml')]