我的输入文件如下所示,在单个文件中达到100k记录
<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>
<pain001><CstmrCdtTrfInitn><GrpHdr><MsgId>ABC/120928/CCT001</MsgId><CreDtTm>2012-09-28T14:07:00</CreDtTm><NbOfTxs>100000</NbOfTxs><CtrlSum>11500000</CtrlSum> <InitgPty><Nm>ABC Corporation</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></InitgPty></GrpHdr><PmtInf><PmtInfId>CARCORP/086</PmtInfId><PmtMtd>TRF</PmtMtd><BtchBookg>false</BtchBookg><ReqdExctnDt>2012-09-29</ReqdExctnDt><Dbtr><Nm>CARCORP INC</Nm><PstlAdr><StrtNm>Times Square</StrtNm><BldgNb>7</BldgNb><PstCd>NY 10036</PstCd><TwnNm>New York</TwnNm><Ctry>US</Ctry></PstlAdr></Dbtr><DbtrAcct><Id><Othr><Id>00125574999</Id></Othr></Id></DbtrAcct><DbtrAgt><FinInstnId><BICFI>BBBBUS33</BICFI></FinInstnId></DbtrAgt><CdtTrfTxInf><PmtId><InstrId>ABC/120928/CCT001/01</InstrId><EndToEndId>ABC/4562/4</EndToEndId></PmtId><Amt><InstdAmt Ccy="JPY">100</InstdAmt></Amt><ChrgBr>SHAR</ChrgBr><CdtrAgt><FinInstnId><BICFI>AAAAGB2L</BICFI></FinInstnId></CdtrAgt><Cdtr><Nm>DEF Electronics</Nm><PstlAdr><AdrLine>Corn Exchange 5th Floor</AdrLine><AdrLine>Mark Lane 55</AdrLine><AdrLine>EC3R7NE London</AdrLine><AdrLine>GB</AdrLine></PstlAdr></Cdtr><CdtrAcct><Id><Othr><Id>23683707994125</Id></Othr></Id></CdtrAcct><Purp><Cd>GDDS</Cd></Purp><RmtInf><Strd><RfrdDocInf><Tp><CdOrPrtry><Cd>CINV</Cd></CdOrPrtry></Tp><Nb>4562</Nb><RltdDt>2012-09-08</RltdDt></RfrdDocInf></Strd></RmtInf></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn></pain001>
我使用了列表理解和Xpath作为解析值的逻辑
def parsexml():
net=[]
tree = ET.parse('pain1.xml')
root = tree.getroot()
grp1x = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/MsgId')]
grp1y = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
grp1 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/Nm')]
grp2 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CreDtTm')]
grp3 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/NbOfTxs')]
grp4 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/CtrlSum')]
grp5 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/StrtNm')]
grp6 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/BldgNb')]
grp7 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/PstCd')]
grp8 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/TwnNm')]
grp9 = [e.text for e in root.findall('CstmrCdtTrfInitn/GrpHdr/InitgPty/PstlAdr/Ctry')]
grp10 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtInfId')]
grp11 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/PmtMtd')]
grp12 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/BtchBookg')]
grp13 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/ReqdExctnDt')]
grp14 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/Nm')]
grp15 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/StrtNm')]
grp16 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/BldgNb')]
grp17 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/PstCd')]
grp18 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/TwnNm')]
grp19 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/Dbtr/PstlAdr/Ctry')]
grp20 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAcct/Id/Othr/Id')]
grp21 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/DbtrAgt/FinInstnId/BICFI')]
grp22 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/InstrId')]
grp23 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/PmtId/EndToEndId')]
grp24 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt')]
grp25= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Amt/InstdAmt[@Ccy="JPY"]')]
grp26 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/ChrgBr')]
grp27 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAgt/FinInstnId/BICFI')]
grp28 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/Nm')]
grp29 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[1]')]
grp30 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[2]')]
grp31 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[3]')]
grp32 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Cdtr/PstlAdr/AdrLine[4]')]
grp33 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/CdtrAcct/Id/Othr/Id')]
grp34 = [e.text for e in root.findall('pain001/CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/Purp/Cd')]
grp35 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Tp/CdOrPrtry/Cd')]
grp36 = [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/Nb')]
grp37= [e.text for e in root.findall('CstmrCdtTrfInitn/PmtInf/CdtTrfTxInf/RmtInf/Strd/RfrdDocInf/RltdDt')]
net = ",".join(grp1x+grp1y+grp1 + grp2 + grp3 + grp4 +grp5+grp6+grp7+grp8+grp9+grp10+grp11+grp12+grp13+grp14+grp15+grp16+grp17+grp18+grp19+grp20+grp21+grp22+grp23+grp24+grp25+grp26+grp27+grp28+grp29+grp30+grp31+grp32+grp33+grp34+grp35+grp36+grp37)
return net
我收到错误
Traceback (most recent call last):
File "C:\Python27\parsefunc.py", line 10, in <module>
tree = ET.parse('pain1.xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: junk after document element: line 2, column 0
解析后我需要的输出如下所示
ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08
ABC/120928/CCT001,2012-09-28T14:07:00,ABC Corporation,2012-09-28T14:07:00,100000,11500000,Times Square,7,NY 10036,New York,US,CARCORP/086,TRF,false,2012-09-29,CARCORP INC,Times Square,7,NY 10036,New York,US,00125574999,BBBBUS33,ABC/120928/CCT001/01,ABC/4562/1,100,100,SHAR,AAAAGB2L,DEF Electronics,Corn Exchange 5th Floor,Mark Lane 55,EC3R7NE London,GB,CINV,4562,2012-09-08
有没有比使用元素树的List Comprehension更好的方法,或者我如何以上述方式解析和获取输出以解析同一文件中的其他xml
更新
我能够使用Parfait建议的新方法在一行中解析和生成,但是当我尝试为多个xml实现以下解决方案时仍然遇到相同的错误 < / p>
导入系统 将lxml.etree导入为ET
net = []
tree = ET.parse('pain001.xml')
root = tree.getroot()
line= tree.xpath('//text()')
line = map(lambda line: line.strip(), line)
net = filter(bool, line)
#str_list = filter(None, str_list)
#net = root.xpath('//*')
net = ",".join(net)
答案 0 :(得分:0)
这不是一个好方法。如果您的文件太大,您将耗尽您的进程内存。 如果您的文件始终具有相同的结构,则可以直接逐行处理并进行输出。 您也可以直接构造一行而不是列表。
答案 1 :(得分:0)
考虑文档中所有子项的XPath表达式,它返回元素标记和文本列表:
net = tree.xpath('//*')
但是,要遍历每个重复的子根<pain001>
并迁移到行和列的csv格式,请考虑子根的每个节点出现的迭代并提取相应的标记和文本。
import os, sys
import csv
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# ITERATE THROUGH ALL XML FILES
for item in os.listdir(cd):
if item.endswith(".xml"):
tree = ET.parse(os.path.join(cd,item))
subroot = tree.xpath("//CstmrCdtTrfInitn")
with open(os.path.join(cd,'MultipleXPaths.csv'), 'ab') as m:
writer = csv.writer(m)
for i in range(1,len(subroot)+1):
nodes = tree.xpath('//CstmrCdtTrfInitn[{0}]//*'.format(i))
cols = []
rows = []
for elem in nodes:
cols.append(elem.tag)
rows.append(elem.text.replace('\n','').strip())
if i == 1:
print ', '.join(cols)+"\n"
writer.writerow(cols)
print ', '.join(rows)+"\n"
writer.writerow(rows)
CONSOLE PRINT OUTPUT (但csv文件中的列和行)
GrpHdr, MsgId, CreDtTm, NbOfTxs, CtrlSum, InitgPty, Nm, PstlAdr, StrtNm,
BldgNb, PstCd, TwnNm, Ctry, PmtInf, PmtInfId, PmtMtd, BtchBookg,
ReqdExctnDt, Dbtr, Nm, PstlAdr, StrtNm, BldgNb, PstCd, TwnNm, Ctry,
DbtrAcct, Id, Othr, Id, DbtrAgt, FinInstnId, BICFI, CdtTrfTxInf, PmtId,
InstrId, EndToEndId, Amt, InstdAmt, ChrgBr, CdtrAgt, FinInstnId, BICFI,
Cdtr, Nm, PstlAdr, AdrLine, AdrLine, AdrLine, AdrLine, CdtrAcct, Id,
Othr, Id, Purp, Cd, RmtInf, Strd, RfrdDocInf, Tp, CdOrPrtry, Cd, Nb, RltdDt
, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086,
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01,
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , ,
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08
, ABC/120928/CCT001, 2012-09-28T14:07:00, 100000, 11500000, , ABC
Corporation, , Times Square, 7, NY 10036, New York, US, , CARCORP/086,
TRF, false, 2012-09-29, , CARCORP INC, , Times Square, 7, NY 10036, New
York, US, , , , 00125574999, , , BBBBUS33, , , ABC/120928/CCT001/01,
ABC/4562/4, , 100, SHAR, , , AAAAGB2L, , DEF Electronics, , Corn
Exchange 5th Floor, Mark Lane 55, EC3R7NE London, GB, , , ,
23683707994125, , GDDS, , , , , , CINV, 4562, 2012-09-08
答案 2 :(得分:0)
ET.parse('pain001.xml')
失败,因为该文件实际上不是xml文件。但它确实每行有一个xml文档,这很好,因为这意味着你不必将整个文档加载到内存中来处理它。
你可以继续你正在做的事情,但把它放在for xmltext in open('somefile'):
循环中,但你也可以减少你在工作时的总工作量。我有点打手自己,因为我在使用ElementTree时在lxml
写了这个,但你可以切换或修改脚本。我们的想法是为列表中的每个字段写出XPath选择器,然后使用该列表为每一行提取数据。肯定会打败每一个。
import lxml.etree
import csv
# compile xpath selectors for element text
selectors = ('GrpHdr/MsgId', 'GrpHdr/CreDtTm') # etc...
xpath = [lxml.etree.XPath('{}/text()'.format(s)) for s in selectors]
# open result csv file
with open('pain.csv', 'w') as paincsv:
writer = csv.writer(paincsv)
# read file with 1 'CstmrCdtTrfInitn' record per line
with open('pain.xml') as painxml:
# process each record
for index, line in enumerate(painxml):
if not line.strip(): # allow empty lines
continue
try:
# each line is an xml doc
pain001 = lxml.etree.fromstring(line)
# move to the customer elem
elem = pain001.find('CstmrCdtTrfInitn')
# select each value and write to csv
writer.writerow([xp(elem)[0].strip() for xp in xpath])
except Exception, e:
# give a hint where things go bad
sys.stderr.write("Error line {}, {}".format(index, str(e)))
raise