我正在使用我编写的一个简单脚本解析一个简单的XML文档(通过几个调整)。这是XML:
<?xml version="1.0" ?>
<library owner="James Wise">
<book>
<title>Sandman Volume 1: Preludes and Nocturnes</title>
<author>Neil Gaiman</author>
</book>
<book>
<title>Good Omens</title>
<author>Neil Gamain</author>
<author>Terry Pratchett</author>
</book>
<book>
<title>The Man And The Goat</title>
<author>Bubber Elderidge</author>
</book>
<book>
<title>Once Upon A Time in LA</title>
<author>Dr Dre</author>
</book>
<book>
<title>There Will Never Be Justice</title>
<author>IR Jury</author>
</book>
<book>
<title>Beginning Python</title>
<author>Peter Norton, et al</author>
</book>
</library>
这是我的Python脚本:
from xml.dom.minidom import parse
import xml.dom.minidom
import csv
def writeToCSV(myLibrary):
csvfile = open('output.csv', 'w')
fieldnames = ['title', 'author', 'author']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
books = myLibrary.getElementsByTagName("book")
for book in books:
titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
for author in book.getElementsByTagName("author"):
authorValue = author.childNodes[0].data
writer.writerow({'title': titleValue, 'author': authorValue})
doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Get book elements in Library
books = myLibrary.getElementsByTagName("book")
# Print each book's title
writeToCSV(myLibrary)
这是我的输出:
title,author
Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman
Good Omens,Neil Gamain
Good Omens,Terry Pratchett
The Man And The Goat,Bubber Elderidge
Once Upon A Time in LA,Dr Dre
There Will Never Be Justice,IR Jury
Beginning Python,"Peter Norton, et al"
请注意,本书“Good Omens”有2位作者,并且显示在两个单独的行上。我真正想要的是它显示如下:
title,author,author
Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman,,
Good Omens,Neil Gamain,Terry Pratchett
The Man And The Goat,Bubber Elderidge,,
Once Upon A Time in LA,Dr Dre,,
There Will Never Be Justice,IR Jury,,
Beginning Python,"Peter Norton, et al",,
如您所见,有3列,因此两位作者显示在同一行。那些只有一位作者的书,只有一个空白的条目,所以两个逗号彼此相邻。
答案 0 :(得分:3)
解决问题的好方法是使用lxml:
>>> with open('doc.xml') as f:
>>> doc = etree.XML(f.read())
>>> for e in doc.xpath('book'):
>>> print (e.xpath('author/text()'), e.xpath('title/text()')[0])
(['Neil Gaiman'], 'Sandman Volume 1: Preludes and Nocturnes')
(['Neil Gamain', 'Terry Pratchett'], 'Good Omens')
(['Bubber Elderidge'], 'The Man And The Goat')
(['Dr Dre'], 'Once Upon A Time in LA')
(['IR Jury'], 'There Will Never Be Justice')
(['Peter Norton, et al'], 'Beginning Python')
然后要生成你的csv,你可以这样做:
with open('output.csv', 'w') as fout:
fieldnames = ['title', 'authors']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for e in doc.xpath('book'):
title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
writer.writerow({'title': titleValue, 'author': authors.join(';')})
或:
with open('output.csv', 'w') as fout:
fieldnames = ['title', 'author1', 'author2']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for e in doc.xpath('book'):
title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
author1, author2 = '', ''
if len(authors) == 2:
author2 = author[1]
if len(authors) == 1:
author1 = author[0]
writer.writerow({'title': titleValue, 'author1': author1, 'author2': author2})
答案 1 :(得分:2)
这是另一种可能的解决方案:
<强> CODE:强>
#! /usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
import csv
def writeToCSV(myLibrary):
with open('output.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(['title', 'author', 'author'])
books = myLibrary.getElementsByTagName("book")
for book in books:
titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
authors = [] # get all the authors in a vector
for author in book.getElementsByTagName("author"):
authors.append(author.childNodes[0].data)
writer.writerow([titleValue] + authors) # write to csv
doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]
# Print each book's title
writeToCSV(myLibrary)
<强>输出:强>
title,author,author
Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman
Good Omens,Neil Gamain,Terry Pratchett
The Man And The Goat,Bubber Elderidge
Once Upon A Time in LA,Dr Dre
There Will Never Be Justice,IR Jury
Beginning Python,"Peter Norton, et al"
亲切的问候,
答案 2 :(得分:0)
要让两位作者在同一行上,只需要一些基本的循环编程。对于每个标题,您需要遍历整个列表以搜索同一标题上的另一个作者。
或者,首先按标题对列表进行排序,这将使相邻记录中的两位作者。您可以使用xml库调用直接对xml结构进行排序。