解析复杂的XML并写入CSV

时间:2015-03-21 20:44:28

标签: python xml python-2.7 csv minidom

我试图解析一个相对复杂的(对我来说!)XML文件。我以前在类似的主题中发布过,并对此有所了解。然而,这引起了我的问题。我的XML文件的摘录:

<?xml version="1.0" ?>
<record number="1" type="custID" first-time="Wed Feb  4 19:22:57 2014" last-time="Fri Feb  7 10:11:02 2015">
    <Customer name="Bob Janotior" custID="4466851">
        <type>Monthly</type>
        <max-books>5</max-books>
        <rental status="false">overdue</essid>
    </Customer>
    <book title="All The Things" type="fiction" author="Jill Taylor" pubID="7744jh566lp">
      <cover>softback</cover>
      <pub>Penguin</pub>
    </book>
    <book title="Mellow Tides of War" type="non-fiction" author="Prof. Lambert et al" pubID="7744gd556se">
      <cover>hardback</cover>
      <pub>Penguin</pub>
    </book>
</record>   
<record number="2" type="custID" first-time="Wed Apr  8 15:23:54 2012" last-time="Fri Feb  7 10:11:02 2015">
    <Customer name="Jayne Wrikcek" custID="4466787">
        <type>Monthly</type>
        <max-books>5</max-books>
        <rental status="false">overdue</essid>
    </Customer>
    <book title="Kiss Me Hardy" type="fiction" author="AR Jones" pubID="766485gf66ki">
      <cover>softback</cover>
      <pub>/Kingsoft</pub>
    </book>
    <book title="Oskar Came Again" type="fiction" author="Johnathan Huphries" pubID="a5555qwd2">
      <cover>hardback</cover>
      <pub>Lofthouse</pub>
    </book>
</record>

所以以前我使用的是我在Python 2.7中编写的脚本:

from xml.dom.minidom import parse
import xml.dom.minidom
import csv

def writeToCSV(myLibrary):
    with open('output.csv', 'wb') as csvfile:
        writer = csv.writer(csvfile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
        writer.writerow(['title', 'author', 'author'])
        books = myLibrary.getElementsByTagName("book")
        for book in books:
            titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
            authors = [] # get all the authors in a vector
            for author in book.getElementsByTagName("author"):
                authors.append(author.childNodes[0].data)
            writer.writerow([titleValue] + authors) # write to csv

doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Print each book's title
writeToCSV(myLibrary)

此脚本实际上是为更简单的XML文件编写的。我很难为这个XML文件调整它,这对我来说是一个更复杂的结构。我正在慢慢掌握minidom和csv写作,但这对我来说还是新的。这是我想在CSV文件中输出的那种输出:

这是我想要在CSV文件中输出的那种输出:

record number,type,Customer name,CustID,type,max-books,rental status,book,title,type,author,
1,custID,Bob Janotoir,4466851,Monthly,5,false,overdue,All The Things,fiction,Jill Taylor,
2,custID,Jayne Wrikcek,4466787,Monthly,5,false,overdue,Kiss Me Hardy,fiction,AR Jones,

1 个答案:

答案 0 :(得分:0)

这是我的XML版本到CSV

我创建了一个字典,我递归地附加了每个xml记录的项目。该代码考虑了具有相同名称的xml子项,并将它们重命名为child,child2,child3等。

希望这会有所帮助:

XML文件: (mdoified - &gt;添加了根节点“树”,将</essid>更改为</rental>

<tree>
    <record number="1" type="custID" first-time="Wed Feb  4 19:22:57 2014" last-time="Fri Feb  7 10:11:02 2015">
        <Customer name="Bob Janotior" custID="4466851">
            <type>Monthly</type>
            <max-books>5</max-books>
            <rental status="false">overdue</rental>
        </Customer>
        <book title="All The Things" type="fiction" author="Jill Taylor" pubID="7744jh566lp">
          <cover>softback</cover>
          <pub>Penguin</pub>
        </book>
        <book title="Mellow Tides of War" type="non-fiction" author="Prof. Lambert et al" pubID="7744gd556se">
          <cover>hardback</cover>
          <pub>Penguin</pub>
        </book>
    </record>
    <record number="2" type="custID" first-time="Wed Apr  8 15:23:54 2012" last-time="Fri Feb  7 10:11:02 2015">
        <Customer name="Jayne Wrikcek" custID="4466787">
            <type>Monthly</type>
            <max-books>5</max-books>
            <rental status="false">overdue</rental>
        </Customer>
        <book title="Kiss Me Hardy" type="fiction" author="AR Jones" pubID="766485gf66ki">
          <cover>softback</cover>
          <pub>/Kingsoft</pub>
        </book>
        <book title="Oskar Came Again" type="fiction" author="Johnathan Huphries" pubID="a5555qwd2">
          <cover>hardback</cover>
          <pub>Lofthouse</pub>
        </book>
    </record>
</tree>

<强>代码:

from collections import defaultdict, OrderedDict
from xml.etree import ElementTree as etree
import csv

# takes as input an xml root, a dictionary where to store the parsed values and an id number suggesting uniqueness of the current node
def parse_node(root, dict, id):
    # Parse this node
    tag_dict = OrderedDict()
    for key, value in root.attrib.items():
        if id > 1: # if there are more than one childs with the same tag
            tag_dict[root.tag + str(id) + ':' + key] = value
        else:
            tag_dict[root.tag + ':' + key] = value
    # Get children of node
    children = root.getchildren()
    # If node has one or more child
    if len(children) >= 1:
        # Loop through all the children
        tag_dict_id = defaultdict(lambda: 0)
        for child in children:
            tag_dict_id[child.tag] += 1 # keep track of the children
            # call to recursion function
            # Parse children
            parse_node(child, tag_dict, tag_dict_id[child.tag])
    # If does not have children and is the 'search_node'
    elif len(children) == 0:
        # Store the text inside the node.
        if id > 1:
            tag_dict[root.tag + str(id) + ':text'] = root.text
        else:
            tag_dict[root.tag + ':text'] = root.text
    # update the current dictionary with the new data
    dict.update(tag_dict)
    return dict

# Input: an xml root node. Output: 'output.csv'
def writeToCSV(records_lib):
    records_list = [] # contains each of the records
    with open('output.csv', 'wb') as csvfile:
        header = OrderedDict() # dictionary with the csv header
        for record in records_lib:
            parsed_record = parse_node(record, OrderedDict(), 1)
            for x in parsed_record.keys():
                header[x] = x
            records_list.append(parsed_record)
        writer = csv.DictWriter(csvfile, fieldnames=header.keys())
        writer.writerow(header)
        for record in records_list:
            writer.writerow(record)


doc = etree.parse('library.xml')
root = doc.getroot()
writeToCSV(root)

<强>输出:

record:first-time,record:last-time,record:type,record:number,Customer:custID,Customer:name,type:text,max-books:text,rental:status,rental:text,book:title,book:pubID,book:type,book:author,cover:text,pub:text,book2:title,book2:pubID,book2:type,book2:author
Wed Feb  4 19:22:57 2014,Fri Feb  7 10:11:02 2015,custID,1,4466851,Bob Janotior,Monthly,5,false,overdue,All The Things,7744jh566lp,fiction,Jill Taylor,hardback,Penguin,Mellow Tides of War,7744gd556se,non-fiction,Prof. Lambert et al
Wed Apr  8 15:23:54 2012,Fri Feb  7 10:11:02 2015,custID,2,4466787,Jayne Wrikcek,Monthly,5,false,overdue,Kiss Me Hardy,766485gf66ki,fiction,AR Jones,hardback,Lofthouse,Oskar Came Again,a5555qwd2,fiction,Johnathan Huphries

亲切的问候,