我想要解析一段很长的XML。我想删除除子类代码和城市之外的所有内容。所以我留下了类似下面的例子。
TEST SUBCLASS |迈阿密
<?xml version="1.0" standalone="no"?>
<web-export>
<run-date>06/01/2010
<pub-code>TEST
<ad-type>TEST
<cat-code>Real Estate</cat-code>
<class-code>TEST</class-code>
<subclass-code>TEST SUBCLASS</subclass-code>
<placement-description></placement-description>
<position-description>Town House</position-description>
<subclass3-code></subclass3-code>
<subclass4-code></subclass4-code>
<ad-number>0000284708-01</ad-number>
<start-date>05/28/2010</start-date>
<end-date>06/09/2010</end-date>
<line-count>6</line-count>
<run-count>13</run-count>
<customer-type>Private Party</customer-type>
<account-number>100099237</account-number>
<account-name>DOE, JOHN</account-name>
<addr-1>207 CLARENCE STREET</addr-1>
<addr-2> </addr-2>
<city>MIAMI</city>
<state>FL</state>
<postal-code>02910</postal-code>
<country>USA</country>
<phone-number>4014612880</phone-number>
<fax-number></fax-number>
<url-addr> </url-addr>
<email-addr>noemail@ttest.com</email-addr>
<pay-flag>N</pay-flag>
<ad-description>DEANESTATES2BEDS2BATHSAPPLIANCED</ad-description>
<order-source>Import</order-source>
<order-status>Live</order-status>
<payor-acct>100099237</payor-acct>
<agency-flag>N</agency-flag>
<rate-note></rate-note>
<ad-content> MIAMI/Dean Estates: 2
beds, 2 baths. Applianced. Central air. Carpets. Laundry. 2 decks. Pool. Parking. Close to everything.No smoking. No utilities. $1275 mo. 401-578-1501. </ad-content>
</ad-type>
</pub-code>
</run-date>
</web-export>
所以我要做的是打开现有文件读取内容,然后使用正则表达式来消除不必要的XML标记。
open(READFILE, "FILENAME");
while(<READFILE>)
{
$_ =~ s/<\?xml version="(.*)" standalone="(.*)"\?>\n.*//g;
$_ =~ s/<subclass-code>//g;
$_ =~ s/<\/subclass-code>\n.*/|/g;
$_ =~ s/(.*)PJ RER Houses /PJ RER Houses/g;
$_ =~ s/\G //g;
$_ =~ s/<city>//g;
$_ =~ s/<\/city>\n.*//g;
$_ =~ s/<(\/?)web-export>(.*)\n.*//g;
$_ =~ s/<(\/?)run-date>(.*)\n.*//g;
$_ =~ s/<(\/?)pub-code>(.*)\n.*//g;
$_ =~ s/<(\/?)ad-type>(.*)\n.*//g;
$_ =~ s/<(\/?)cat-code>(.*)<(\/?)cat-code>\n.*//g;
$_ =~ s/<(\/?)class-code>(.*)<(\/?)class-code>\n.*//g;
$_ =~ s/<(\/?)placement-description>(.*)<(\/?)placement-description>\n.*//g;
$_ =~ s/<(\/?)position-description>(.*)<(\/?)position-description>\n.*//g;
$_ =~ s/<(\/?)subclass3-code>(.*)<(\/?)subclass3-code>\n.*//g;
$_ =~ s/<(\/?)subclass4-code>(.*)<(\/?)subclass4-code>\n.*//g;
$_ =~ s/<(\/?)ad-number>(.*)<(\/?)ad-number>\n.*//g;
$_ =~ s/<(\/?)start-date>(.*)<(\/?)start-date>\n.*//g;
$_ =~ s/<(\/?)end-date>(.*)<(\/?)end-date>\n.*//g;
$_ =~ s/<(\/?)line-count>(.*)<(\/?)line-count>\n.*//g;
$_ =~ s/<(\/?)run-count>(.*)<(\/?)run-count>\n.*//g;
$_ =~ s/<(\/?)customer-type>(.*)<(\/?)customer-type>\n.*//g;
$_ =~ s/<(\/?)account-number>(.*)<(\/?)account-number>\n.*//g;
$_ =~ s/<(\/?)account-name>(.*)<(\/?)account-name>\n.*//g;
$_ =~ s/<(\/?)addr-1>(.*)<(\/?)addr-1>\n.*//g;
$_ =~ s/<(\/?)addr-2>(.*)<(\/?)addr-2>\n.*//g;
$_ =~ s/<(\/?)state>(.*)<(\/?)state>\n.*//g;
$_ =~ s/<(\/?)postal-code>(.*)<(\/?)postal-code>\n.*//g;
$_ =~ s/<(\/?)country>(.*)<(\/?)country>\n.*//g;
$_ =~ s/<(\/?)phone-number>(.*)<(\/?)phone-number>\n.*//g;
$_ =~ s/<(\/?)fax-number>(.*)<(\/?)fax-number>\n.*//g;
$_ =~ s/<(\/?)url-addr>(.*)<(\/?)url-addr>\n.*//g;
$_ =~ s/<(\/?)email-addr>(.*)<(\/?)email-addr>\n.*//g;
$_ =~ s/<(\/?)pay-flag>(.*)<(\/?)pay-flag>\n.*//g;
$_ =~ s/<(\/?)ad-description>(.*)<(\/?)ad-description>\n.*//g;
$_ =~ s/<(\/?)order-source>(.*)<(\/?)order-source>\n.*//g;
$_ =~ s/<(\/?)order-status>(.*)<(\/?)order-status>\n.*//g;
$_ =~ s/<(\/?)payor-acct>(.*)<(\/?)payor-acct>\n.*//g;
$_ =~ s/<(\/?)agency-flag>(.*)<(\/?)agency-flag>\n.*//g;
$_ =~ s/<(\/?)rate-note>(.*)<(\/?)rate-note>\n.*//g;
$_ =~ s/<ad-content>(.*)\n.*//g;
$_ =~ s/\t(.*)\n.*//g;
$_ =~ s/<\/ad-content>(.*)\n.*//g;
}
close( READFILE1 );
有更简单的方法吗?我不想使用任何模块。我知道它可能会使这更容易,但我正在阅读的文件中包含大量数据。
答案 0 :(得分:12)
这太可怕了(对不起)。即使您有大量数据,正则表达式也不一定更快。
为什么不使用XSLT?
您的样式表基本上会是这样的(如果您只有一个subclass-code
和city
元素):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="/">
<xsl:apply-templates select="//subclass-code|//city" />
</xsl:template>
<xsl:template match="subclass-code">
<xsl:value-of select="." /><xsl:text> | </xsl:text>
</xsl:template>
<xsl:template match="city">
<xsl:value-of select="." /><xsl:text> </xsl:text>
</xsl:template>
</xsl:stylesheet>
(更新了代码以使用多个元素。可能不是最佳解决方案;))
答案 1 :(得分:7)
如果某人已经编写了高效(并且我敢说功能丰富)模块(如XML :: Twig)来解析XML,为什么不使用库?
use XML::Twig;
die "Usage: give-me-the-elements.pl <xml_file>\n" unless ($ARGV[0]);
my $twig = XML::Twig->new( twig_handlers =>
{ 'subclass-code' => sub { print->text, "|"; },
'city' => sub { print $_->text, "\n"; },
},
pretty_print => 'indented');
$twig->parsefile($ARGV[0]);
$twig->purge;
答案 2 :(得分:5)
如果您需要一般的XML解析方法,请不要使用正则表达式。如果你只需要你所说的(删除除子类代码和城市之外的所有内容),如果你确定这两个标签内部没有“奇怪”的东西(xml实体,CDATA部分)并且那些标签不会出现在其他CDATA片段等中,您可以这样做:
$/ = undef; # slurp mode
open(READFILE, "FILENAME");
$t = <READFILE>;
close READFILE;
$t =~ s#^.*<subclass-code>(.*?)</subclass-code>.*<city>(.*?)</city>.*$#$1 - $2#s;
# in case the tags could appear in distinct order - uncomment the following
# $t =~ s#^.*<city>(.*?)</city>.*<subclass-code>(.*?)</subclass-code>.*$#$2 - $1#s;
print $t;
编辑:根据海报的要求,多一点(咳咳)强大:
while( $t =~ m#<pub-code>([^<\s]*).*?<subclass-code>(.*?)</subclass-code>.*?<city>(.*?)</city>#sg) {
print "$1 : $2 | $3 \n";
}
但请停在这里,不要走得更远,这种方式导致地狱......
答案 3 :(得分:5)
执行此操作的简单方法是将XML::Simple与转储程序结合使用(我喜欢XXX,大多数使用Data::Dumper。这会将XML加载到perl数据结构中您可以在哪里挑选您想要的属性(或者如果您愿意明确delete
则不想要。)
使用我刚刚建议的工具集,您可以看到您想要的运行示例:
use strict;
use warnings;
use XML::Simple;
my $data = XML::Simple::parse_fh( \*DATA );
my $sub = $data->{'run-date'}{'pub-code'}{'ad-type'};
foreach my $k ( keys %$sub ) {
delete $sub->{$k}
unless $k =~ /subclass-code|city/
;
}
use XXX;
XXX $data;
答案 4 :(得分:1)
注意其他海报所说的内容,强烈建议在解析标记语言时远离正则表达式。
但是,在没有任何模块的情况下完成所需内容的纯perl方法并假设上述标记确实存在:
my $reg_subclass = '\<city\>';
my $reg_city = '\<subclass\d*\-code\>';
open my $in, "input file";
open my $out, '>' ,"output file";
while ( my $line = <$in> ) {
if ( $line =~ /$reg_subclass|$reg_city/ ) {
print $out $line;
}
}
close $in;
close $out;
答案 5 :(得分:0)
我不是Perl支持的专家,但一般来说,我认为你想在这里使用XPath。 (这可能是上面的Twig库使用的,我不确定。)
Pseudo-Perl示例(请原谅粗俗;因为我真的广泛使用Perl已经有一段时间了):
$subclassExpr = "/web-export/subclass-code/text()";
$cityExpr = "/web-export/city/text()";
$domObject = xml_dom_parse( $xml_doc );
$subClass = xpath_evaluate( $domObject, $subclassExpr );
$subClass = xpath_evaluate( $domObject, $cityExpr );