在Python中解析多个xml文件

时间:2017-06-07 12:06:39

标签: python-3.x function for-loop xml-parsing

我在这里遇到了问题。所以我想解析其中包含相同结构的多个xml文件。我已经能够获取每个文件的所有位置并将它们保存到三个不同的列表中,因为有三种不同类型的xml结构。现在我想创建三个函数(对于每个列表),它循环遍历列表并解析我需要的信息。不知怎的,我无法做到。这里的任何人都可以给我一个提示怎么做?

import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys


#### Get the location of each XML file and save them into a list ####

all_xml_list =[]                                                                                                                                       

def locate(pattern,root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in fnmatch.filter(files,pattern):
            yield os.path.join(path,filename)

for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
    all_xml_list.append(files)


#### Create lists by GameDay Events ####


xml_GameDay_Player   = [x for x in all_xml_list if 'Player' in x]                                                             
xml_GameDay_Team     = [x for x in all_xml_list if 'Team' in x]                                                             
xml_GameDay_Match    = [x for x in all_xml_list if 'Match' in x]  

XML文件如下所示:

<sports-content xmlns:imp="url">
  <sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
    <sports-title>player-statistics-165483</sports-title>
  </sports-metadata>
  <sports-event>
    <event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
    <team>
      <team-metadata id="O_17" team-key="17">
        <name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
      </team-metadata>
      <player>
        <player-metadata player-key="33201" uniform-number="1">
          <name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
        </player-metadata>
        <player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
          <rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
          <rating rating-type="grade" rating-value="2.2" />
          <rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
          <rating rating-type="bemeister" rating-value="16.04086" />
          <player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
            <stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
            <stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
            <stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
          </player-stats-soccer>
        </player-stats>
      </player>
    </team>
  </sports-event>
</sports-content>

我想提取“玩家元标记”和“玩家统计数据覆盖率”和“玩家统计数据足球”标记内的所有内容。

3 个答案:

答案 0 :(得分:2)

改进@Gnudiff的答案,这是一种更有弹性的方法:

import os
from glob import glob
from lxml import etree

xml_GameDay = {
    'Player': [],
    'Team': [],
    'Match': [],
}

# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
    for key in xml_GameDay.keys():
        if key in os.path.basename(filename):
            xml_GameDay[key].append(filename)
            break

def select_first(context, path):
    result = context.xpath(path)
    if len(result):
        return result[0]
    return None

# extract data from Player files
for filename in xml_GameDay['Player']:
    tree = etree.parse(filename)

    for player in tree.xpath('.//player'):        
        player_data = {
            'key': select_first(player, './player-metadata/@player-key'),
            'lastname': select_first(player, './player-metadata/name/@last'),
            'firstname': select_first(player, './player-metadata/name/@first'),
            'nickname': select_first(player, './player-metadata/name/@nickname'),
        }
        print(player_data)
        # ...

XML文件可以有多种字节编码,并以 XML声明为前缀,它声明了文件其余部分的编码。

<?xml version="1.0" encoding="UTF-8"?>

UTF-8是XML文件的常见编码(它也是默认编码),但实际上它可以是任何东西。这是不可能预测的,并且对您的程序进行硬编码以期望某种编码是非常糟糕的做法。

XML解析器旨在以透明的方式处理这种特性,因此您不必担心它,除非您做错了

这是做错的好例子:

# BAD CODE, DO NOT USE
def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

tree = etree.XML(file_get_contents('some_filename.xml'))

这里发生的是:

  1. Python以文本文件filename
  2. 打开f
  3. f.read()返回一个字符串
  4. etree.XML()解析该字符串并创建一个DOM对象tree
  5. 听起来不是这么错,是吗?但如果XML是这样的:

    <?xml version="1.0" encoding="UTF-8"?>
    <Player nickname="Mäxchen">...</Player>
    

    那么你最终得到的DOM将是:

    Player
        @nickname="Mäxchen"
    

    您刚刚销毁了数据。除非XML包含&#34;扩展&#34;像ä这样的角色,你甚至都不会注意到这种方法是不可靠的。这很容易被忽视。

    打开XML文件只有一种正确的方法(它也比上面的代码更简单):将文件名提供给解析器。

    tree = etree.parse('some_filename.xml')
    

    通过这种方式,解析器可以在读取数据之前找出文件编码,并且您不必关心这些细节。

答案 1 :(得分:0)

对于您的特定情况,这不是一个完整的解决方案,因为这是一项要做的任务,而且我没有键盘,在平板电脑上工作。

通常,您可以通过多种方式执行此操作,具体取决于您是否确实需要所有数据或提取特定子集,以及您是否事先知道所有可能的结构。

例如,一种方式:

from lxml import etree
Playerdata=[] 
for F in xml_Gameday_Player:
                tree=etree.XML(file_get_contents(F)) 
                for player in tree.xpath('.//player'):
                        row=[] 
                        row['player']=player.xpath('./player-metadata/name/@Last/text()')       
                        for plrdata in player.xpath('.//player-stats'):
                               #do stuff with player data
                         Playerdata+=row

这是从我现有的脚本改编而来的,但它更适合于只提取xml的特定子集。如果您需要所有数据,最好使用一些xml树walker。

file_get_contents是一个小辅助函数:

def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

Xpath是一种用于在xml中查找节点的强大语言。 请注意,根据您使用的Xpath,结果可能是“for player in ...”语句中的xml节点,也可能是“row ['player'] =”语句中的字符串。

答案 2 :(得分:0)

您使用xml元素树库。首先通过pip install lxml进行安装。然后遵循以下代码结构:

import xml.etree.ElementTree as ET
import os
my_dir = "your_directory"
for fn in os.listdir(my_dir):
    tree = ET.parse(os.path.join(my_dir,fn))
    root = tree.getroot()
    btf = root.find('tag_name')
    btf.text = new_value #modify the value of the tag to new_value, whatever you want to put
    tree.write(os.path.join(my_dir,fn))

如果您仍然需要详细说明,请通过此链接 https://www.datacamp.com/community/tutorials/python-xml-elementtree