我想使用Anydata-0.12将XML文件转换为CSV。 XML文件看起来像这样:
<FIXML r="20030618" s="20040109" v="4.4" xr="FIA" xv="1" xmlns="http://www.fixprotocol.org/FIXML-4-4">
<Batch>
<MktDataFull RptID="23520135" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20171215" MatDt="2017-12-15" CFI="OCASPS" StrkPx="100" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="5.7367" Ccy="USD" PxDelta="0.5" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="30818621" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20180615" MatDt="2018-06-15" CFI="OCASPS" StrkPx="100" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="7.3603" Ccy="USD" PxDelta="0.52" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165289" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="101" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.7973" Ccy="USD" PxDelta="0.46" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165443" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="102" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.2775" Ccy="USD" PxDelta="0.35" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165368" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="103" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.8861" Ccy="USD" PxDelta="0.25" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165483" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="104" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.5858" Ccy="USD" PxDelta="0.25" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="25807539" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170616" MatDt="2017-06-16" CFI="OCASPS" StrkPx="105" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.321" Ccy="USD" PxDelta="0.26" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="30818579" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20180615" MatDt="2018-06-15" CFI="OCASPS" StrkPx="105" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="4.7838" Ccy="USD" PxDelta="0.4" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="32444397" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170616" MatDt="2017-06-16" CFI="OCASPS" StrkPx="106" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.0134" Ccy="USD" PxDelta="0.26" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="32868839" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170120" MatDt="2017-01-20" CFI="OCASPS" StrkPx="107" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.0079" Ccy="USD" PxDelta="0" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="32444384" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170616" MatDt="2017-06-16" CFI="OCASPS" StrkPx="109" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.4888" Ccy="USD" PxDelta="0.11" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
....
....
</Batch>
</FIXML>
CSV文件包含部分XML。它应该具有XML文件中使用的列标题,如下所示:
RptID,BizDt,StrkMult,Sym,StrkValu,Mult,MatDt,CFI,StrkCcy,MMY,StrkPx
23520135,2016-12-09,1,OEF,100,100,2017-12-15,OCASPS,USD,20171215,100
30818621,2016-12-09,1,OEF,100,100,2018-06-15,OCASPS,USD,20180615,100
31165289,2016-12-09,1,OEF,100,100,2017-03-17,OCASPS,USD,20170317,101
31165443,2016-12-09,1,OEF,100,100,2017-03-17,OCASPS,USD,20170317,102
31165368,2016-12-09,1,OEF,100,100,2017-03-17,OCASPS,USD,20170317,103
31165483,2016-12-09,1,OEF,100,100,2017-03-17,OCASPS,USD,20170317,104
...
我正在运行此代码:
use AnyData;
my $input_xml = "oc170120.xml"; #name of the XML file
my $output_csv = "test3.csv"; #name of the output file
$flags->{record_tag} = 'Instrmt';
my $table = adTie( 'XML', $input_xml, 'r', $flags );
....
它正在运行并且有一个小文件用于测试目的一切都很好。但过了一段时间我才得到
内存不足!
当adtie()
尝试将整个文件读入内存时,XML文件有超过400000条记录。
我在64位系统上使用Perl 5.24.1。
答案 0 :(得分:4)
好的,所以XML的问题在于你可以现实地假设在内存中它的大小是磁盘上的大小的10倍&#39;。
因此阅读整个内容,然后将其丢弃对内存效率非常低,对于较大的文件 - 嗯,正如您所知,这是一个大问题。
对于这类任务(并且公平地说,大多数XML任务 - 我是一个巨大的粉丝)我喜欢XML::Twig
,因为它允许你使用twig_handlers
来解析文件并丢弃&#39;处理&#39;你去的位 - 这可以减少内存占用。
所以对你的例子来说:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my @keys = qw ( RptID BizDt Sym StrkValu Mult MatDt CFI StrkCcy MMY StrkPx );
sub process_data {
my ( $twig, $data ) = @_;
# print join ",", map { $data -> get_xpath(".//*[\@$_]",0 )-> text } @keys;
my %atts = map { %{$_->atts} } $data , $data -> children;
print join ",", (map { $atts{$_} // '' } @keys),"\n";
$data -> purge;
}
print join ",", @keys, "\n";
XML::Twig -> new ( twig_handlers => { 'MktDataFull' => \&process_data } ) -> parse (\*DATA);
__DATA__
<FIXML r="20030618" s="20040109" v="4.4" xr="FIA" xv="1" xmlns="http://www.fixprotocol.org/FIXML-4-4">
<Batch>
<MktDataFull RptID="23520135" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20171215" MatDt="2017-12-15" CFI="OCASPS" StrkPx="100" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="5.7367" Ccy="USD" PxDelta="0.5" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="30818621" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20180615" MatDt="2018-06-15" CFI="OCASPS" StrkPx="100" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="7.3603" Ccy="USD" PxDelta="0.52" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165289" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="101" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.7973" Ccy="USD" PxDelta="0.46" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165443" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="102" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.2775" Ccy="USD" PxDelta="0.35" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165368" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="103" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.8861" Ccy="USD" PxDelta="0.25" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="31165483" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170317" MatDt="2017-03-17" CFI="OCASPS" StrkPx="104" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.5858" Ccy="USD" PxDelta="0.25" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="25807539" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170616" MatDt="2017-06-16" CFI="OCASPS" StrkPx="105" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.321" Ccy="USD" PxDelta="0.26" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="30818579" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20180615" MatDt="2018-06-15" CFI="OCASPS" StrkPx="105" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="4.7838" Ccy="USD" PxDelta="0.4" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="32444397" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170616" MatDt="2017-06-16" CFI="OCASPS" StrkPx="106" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="1.0134" Ccy="USD" PxDelta="0.26" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="32868839" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170120" MatDt="2017-01-20" CFI="OCASPS" StrkPx="107" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.0079" Ccy="USD" PxDelta="0" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
<MktDataFull RptID="32444384" BizDt="2016-12-09"><Instrmt Sym="OEF" MMY="20170616" MatDt="2017-06-16" CFI="OCASPS" StrkPx="109" StrkMult="1" StrkValu="100" Mult="100" StrkCcy="USD"/><Full Typ="5" Px="0.4888" Ccy="USD" PxDelta="0.11" Dt="2016-12-09"/><Full Typ="D" Px="100.15" Dt="2016-12-09"/></MktDataFull>
</Batch>
</FIXML>
现在,您可能想要使用:
XML::Twig -> new ( ... ) -> parsefile ('your_xml_file');
也许可以打开print
输出到的文件句柄(此时它会转到STDOUT
,以便用于说明目的)
但上述重点是purge
调用,告诉XML::Twig
您已完成解析,并腾出“处理过的”#39}。来自内存的数据。
所以应该以更低的占地面积做你想做的事。