新手Perl程序员,尝试将简单的xml字符串转换为制表符分隔的文本文件。 我努力使用XML :: Parser(以及XML :: Twig / Simple甚至XSLT),但我无法弄清楚如何将主要数据部分作为列标题。
然后我开始尝试使用XSLT,但我无法弄清楚如何在元素之间获得分隔符 - (然后我会使用split和/或join?)但它们都只是在一起运行一串。
我只是手动手动打印列标题。使用模板有一种简单的方法吗?
我看过类似的问题,但看不到任何分隔符被添加到我的文件中。 XML to Tab delimited Text Modifying a XSLT for converting XML to tab delimited text file
问题:
一般来说,最简单的方法是什么?我应该使用XSLT(我一直在努力理解)。
如何解决以下问题?
似乎我很接近,但只需要在XSLT输出字符串中获取分隔符,这样我就可以将其拆分,然后将其与输出中的“\ t”连接到制表符分隔的文本文件。 ??
这是我的XML(来自Twilio的SMS日志):
<?xml version="1.0" encoding="UTF-8"?>
<TwilioResponse>
<SMSMessages end="49" firstpageuri="/2010-04-01/Accounts/ACcbaa0/SMS/Messages?Page=0&PageSize=50" lastpageuri="/2010-04-01/Accounts/ACcbaa/SMS/Messages?Page=54&PageSize=50" nextpageuri="/2010-04-01/Accounts/ACcbaa0103c/SMS/Messages?Page=1&PageSize=50&AfterSid=SMc20cf7" numpages="55" page="0" pagesize="50" previouspageuri="" start="0" total="2703" uri="/2010-04-01/Accounts/ACcbaa0103cf/SMS/Messages">
<SMSMessage>
<Sid>SMe24eb108b7eb6a3b</Sid>
<DateCreated>Fri, 09 Aug 2013 00:07:59 +0000</DateCreated>
<DateUpdated>Fri, 09 Aug 2013 00:07:59 +0000</DateUpdated>
<DateSent>Fri, 09 Aug 2013 00:07:59 +0000</DateSent>
<AccountSid>ACcbaa0103c4141e5cd754042cb424d4ff</AccountSid>
<To>+14444444444</To>
<From>+15555555555</From>
<Body>Hi there!</Body>
<Status>sent</Status>
<Direction>outbound-api</Direction>
<Price>-0.01000</Price>
<PriceUnit>USD</PriceUnit>
<ApiVersion>2010-04-01</ApiVersion>
<Uri>/2010-04-01/Accounts/ACcbaa01/SMS/Messages/SMe24eb108b</Uri>
</SMSMessage>
<SMSMessage>
... etc. ...
</SMSMessage>
</SMSMessages>
</TwilioResponse>
这是我尝试使用的XSLT:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:template match="//TwilioResponse">
<xsl:for-each select="SMSMessage">
<xsl:value-of select="Sid"/>
<!-- I tried all these, too:   	 even 
 -->
<xsl:text>	</xsl:text>
<!-- I also tried this from another SO question -->
<xsl:if test="position() != last()">, </xsl:if>
<xsl:value-of select="DateCreated"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="DateUpdated"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="DateSent"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="AccountSid"/>
<xsl:text>	</xsl:text>
<xsl:text>
</xsl:text>
<xsl:text> </xsl:text>
<xsl:text>	</xsl:text>
<xsl:value-of select="To"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="From"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="Body"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="Status"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="Direction"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="Price"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="PriceUnit"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="ApiVersion"/>
<xsl:text>	</xsl:text>
<xsl:value-of select="Uri"/>
<!-- I tried both of these: line feed char -->
<xsl:text>
</xsl:text>
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
这是我的Perl代码的相关部分:
use XML::XSLT;
my $logs = $twilio -> GET ('SMS/Messages');
my $string = $logs->{content};
my $xsl = 'xsl.txt';
my $xslt = XML::XSLT->new ($xsl);
$xslt->transform ($string);
my $xsltToString = $xslt->toString;
print $xsltToString;
my $columnHeadings = "Sid\tDateCreated\tDateUpdated\tDateSent\tAccountSid\tTo\tFrom\tBody\tStatus\tDirection\tPrice\tPriceUnit\tApiVersion\tUri\n";
open(my $fh, '>', 'textfile.txt') || die("Unable to open file. $!");
print $fh $columnHeadings;
foreach my $k (@split) {
print $fh join("\t", $xsltToString) . "\t";
}
#print $fh split("\t", $val). "\t"; ;
close($fh);
$xslt->dispose();
# P.S. I'm sure there's a better way to check and see how many lines were saved.
my $xmllines = 0;
open $fh, '<', 'textfile.txt' or die "Could not open file. $!";
while (<$fh>) {
$xmllines++;
}
print ("\n" . $xmllines . " lines saved to tab-delimited logs textfile. \n");
close $fh;
我的输出是一件事,任何元素之间都没有分离。
答案 0 :(得分:2)
我认为XSLT是解决这个问题的错误工具:XML→XML转换非常棒,但对于XML→CSV转换来说太冗长了。我们可以使用Perl的XML::LibXML
模块或类似的东西来解析XML并应用XPath查询,而不是应用XSLT样式,并使用Text::CSV
将数据发送到文件。
use strict; use warnings;
use autodie;
use XML::LibXML;
use Text::CSV;
# Parse the XML
my $xml = XML::LibXML->load_xml(string => ...);
# Prepare the CSV
open my $csv_fh, ">:utf8", "textfile.csv";
my $csv = Text::CSV->new({
binary => 1,
eol => "\n",
# sep_char => "\t", # for tab separation. Default is comma
# quote_space => 0, # makes tab seperated data look better.
});
my @columns = qw/
Sid
DateCreated DateUpdated DateSent
AccountSid
To From Body
Status
Direction
Price PriceUnit
ApiVersion
Uri
/;
$csv->print($csv_fh, \@columns); # print the header
# loop through all messages. Note that `print` wants an arrayref.
for my $sms ($xml->findnodes('//SMSMessage')) {
$csv->print($csv_fh, [ map { $sms->findvalue("./$_") } @columns ]);
}
输出:
Sid,DateCreated,DateUpdated,DateSent,AccountSid,To,From,Body,Status,Direction,Price,PriceUnit,ApiVersion,Uri
SMe24eb108b7eb6a3b,"Fri, 09 Aug 2013 00:07:59 +0000","Fri, 09 Aug 2013 00:07:59 +0000","Fri, 09 Aug 2013 00:07:59 +0000",ACcbaa0103c4141e5cd754042cb424d4ff,+14444444444,+15555555555,"Hi there!",sent,outbound-api,-0.01000,USD,2010-04-01,/2010-04-01/Accounts/ACcbaa01/SMS/Messages/SMe24eb108b
,,,,,,,,,,,,,
或以制表符分隔的版本:
Sid DateCreated DateUpdated DateSent AccountSid To From Body Status Direction Price PriceUnit ApiVersion Uri
SMe24eb108b7eb6a3b Fri, 09 Aug 2013 00:07:59 +0000 Fri, 09 Aug 2013 00:07:59 +0000 Fri, 09 Aug 2013 00:07:59 +0000 ACcbaa0103c4141e5cd754042cb424d4ff +14444444444 +15555555555 Hi there! sent outbound-api -0.01000 USD 2010-04-01 /2010-04-01/Accounts/ACcbaa01/SMS/Messages/SMe24eb108b
(最后一行未显示)
请注意,将CSV与任何分隔符char一起使用可能是一个坏主意:当邮件包含换行符或标签时会发生什么?基本GSM 03.38 charset至少包含LF和CR字符。
\
是引用运算符,因此\@columns
是指向@columns
数组的数组引用。
map
函数需要一段代码和一个列表。像foreach
循环一样,它为列表中的每个值执行此块。在每次迭代中,$_
变量都设置为当前元素。与foreach
循环不同,map
返回值列表。这使它适合转换。例如,加倍一些数字:
my @doubles = map { $_ * 2 } 1 .. 5; #=> 2, 4, 6, 8, 10
DOM节点的findvalue
方法在此节点的上下文中应用XPath表达式,并返回找到的元素的文本值。 XPath表达式./foo
等同于foo
,并搜索名为foo
的子元素。我们使用$_
变量来表示列名/标记名。所以地图表达式
map { $sms->findvalue("./$_") } @columns
将列列表转换为文本值列表。我使用了表单./foo
作为XPath表达式,因为我认为它更好地传达了“给我一个直接的孩子(/
)”的标签名称foo
this SMS(.
)“,特别是当一个用于文件路径的表示法时。
[ ... ]
运算符是一种从列表中创建数组引用的方法。例如。 [1, 2, 3]
是
my @temp = (1, 2, 3);
\@temp;
(再次注意\
运算符)。
答案 1 :(得分:0)
以下是使用XML::Twig的示例:
#!/usr/bin/env perl
use strict;
use warnings;
use Const::Fast;
use Text::CSV;
use XML::Twig;
run({
csv => Text::CSV->new({
always_quote => 1,
binary => 1,
}),
in_fh => \*DATA,
out_fh => \*STDOUT,
wanted_fields => [
qw(
Sid
DateCreated
DateUpdated
DateSent
AccountSid
To
From
Body
Status
Direction
Price
PriceUnit
ApiVersion
Uri
)
],
});
sub run {
my $args = shift;
my $twig = XML::Twig->new(
twig_roots => {
SMSMessage => sub { print_csv($args, @_) },
}
);
$twig->parse($args->{in_fh});
}
sub print_csv {
my $args = shift;
my $twig = shift;
my $elt = shift;
my %fields = map { $_->name, $_->text } $elt->children;
my $csv = $args->{csv};
my $wanted = $args->{wanted_fields};
$csv->combine(@fields{ @{$args->{wanted_fields}} });
print { $args->{out_fh} } $csv->string, "\n";
$twig->purge;
return;
}
__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<TwilioResponse>
<SMSMessages end="49" firstpageuri="/2010-04-01/Accounts/ACcbaa0/SMS/Messages?Page=0&PageSize=50" lastpageuri="/2010-04-01/Accounts/ACcbaa/SMS/Messages?Page=54&PageSize=50" nextpageuri="/2010-04-01/Accounts/ACcbaa0103c/SMS/Messages?Page=1&PageSize=50&AfterSid=SMc20cf7" numpages="55" page="0" pagesize="50" previouspageuri="" start="0" total="2703" uri="/2010-04-01/Accounts/ACcbaa0103cf/SMS/Messages">
<SMSMessage>
<Sid>SMe24eb108b7eb6a3b</Sid>
<DateCreated>Fri, 09 Aug 2013 00:07:59 +0000</DateCreated>
<DateUpdated>Fri, 09 Aug 2013 00:07:59 +0000</DateUpdated>
<DateSent>Fri, 09 Aug 2013 00:07:59 +0000</DateSent>
<AccountSid>ACcbaa0103c4141e5cd754042cb424d4ff</AccountSid>
<To>+14444444444</To>
<From>+15555555555</From>
<Body>Hi there!</Body>
<Status>sent</Status>
<Direction>outbound-api</Direction>
<Price>-0.01000</Price>
<PriceUnit>USD</PriceUnit>
<ApiVersion>2010-04-01</ApiVersion>
<Uri>/2010-04-01/Accounts/ACcbaa01/SMS/Messages/SMe24eb108b</Uri>
</SMSMessage>
<SMSMessage>
... etc. ...
</SMSMessage>
</SMSMessages>
</TwilioResponse>