我在perl中使用SimpleXml来提取标记
中的数据<description><strong>CUSIP:</strong> 912828UC2<br /><strong>Term and Type:</strong> 3-Year Note<br /><strong>Offering Amount:</strong> $32,000,000,000<br /><strong>Auction Date:</strong> 12/11/2012<br /><strong>Issue Date:</strong> 12/17/2012<br /><strong>Maturity Date:</strong> 12/15/2015<br /><a href="http://www.treasurydirect.gov/instit/annceresult/press/preanre/2012/A_20121206_6.pdf">PDF version of the announcement</a><br /><a href="http://www.treasurydirect.gov/xml/A_20121206_6.xml">XML version of the announcement</a><br /></description>
我现在无法提取单个符号。例如,对于拍卖日期,请使用
if ($desc=~m/Auction\sDate:<\/strong>\s+(\d\d\/\d\d\/\d\d\d\d)<br/
) {}
但我觉得它不够健壮。提取字段的标准方法是什么?
答案 0 :(得分:2)
正如Dan1111在他的回答中指出的那样,如果您已经在使用XML解析器(Simple :: XML?),那么您应该坚持使用它来解析描述标记中的数据。尝试从XML / HTML提要中解析数据不是一个好主意;使用为此目的而构建的解析器。
由于帖子中数据的格式化,我假设您没有解析器可以帮助您的有效HTML。在这种情况下,没有“标准”的方法来提取字段,但这是我解决这个问题的方式:
print "$desc\n";
my @parts = split(/;br /, $desc);
my %dates;
foreach my $part (@parts) {
if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
$dates{$1} = $2;
}
}
foreach my $label (keys %dates) {
printf "%-16s%12s\n", "${label}:", $dates{$label};
}
查看原始字符串,我可以看到有3个日期和其他几个记录,因此首先要做的是split
它们。我发现字符串中的每条记录都由字符'; br'分隔,所以我用它来分割:
my @parts = split(/;br /, $desc);
执行此操作后,我有一个数组,其中包含字符串中的每个不同数据部分。现在,我只需要解析每个部分。因为您的问题对拍卖日期值感兴趣,我写了一个将捕获日期的正则表达式。期待其他日期也可能有价值,我修改了我的正则表达式,以便我可以捕获标签(拍卖,发行,成熟度),并将每个标签日期对存储在一个哈希值(%date)中:
foreach my $part (@parts) {
if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
$dates{$1} = $2;
}
}
最后,我刚打印出我的哈希:
foreach my $part (@parts) {
if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
$dates{$1} = $2;
}
}
有意义吗?
答案 1 :(得分:0)
更强大的内容取决于您的预期输入和您正在寻找的内容。但是,您可能会发现这些内容很有帮助。
我使用了XML::Twig
。由于各种怪癖,XML::Simple
(我假设你现在正在使用它)不推荐用于新开发。
use Modern::Perl;
use XML::Twig;
my $twig = XML::Twig->new();
$twig->parse(<DATA>);
my %params;
my $key;
for my $child (map {$_->text} $twig->root->children)
{
if ($child =~ /(.*):/)
{
$key = $1;
}
else
{
$params{$key} = $child if (defined $key);
undef $key;
}
}
say "$_ is $params{$_}" foreach (keys %params);
__DATA__
<description><strong>CUSIP:</strong> 912828UC2<br /><strong>Term and Type:</strong> 3-Year Note<br /><strong>Offering Amount:</strong> $32,000,000,000<br /><strong>Auction Date:</strong> 12/11/2012<br /><strong>Issue Date:</strong> 12/17/2012<br /><strong>Maturity Date:</strong> 12/15/2015<br /><a href="http://www.treasurydirect.gov/instit/annceresult/press/preanre/2012/A_20121206_6.pdf">PDF version of the announcement</a><br /><a href="http://www.treasurydirect.gov/xml/A_20121206_6.xml">XML version of the announcement</a><br /></description>
这将以冒号结尾的任何元素作为键,然后假定树中的下一个元素是值。显然,这会假设您将获得什么样的输入,但只要所有“关键”元素都包含在标记中,它就会非常强大。
另一种方法是首先剥离所有标签,然后在文本中搜索键值对。您也可以使用XML::Twig
执行此操作;只需调用$twig->root->text
即可获取整个元素的文本。但是,在这种方法中,确定一个键的结束位置和另一个值的开始将是棘手的。
答案 2 :(得分:0)
您显示的RSS Feed中的<description>
元素包含有效的XHTML片段作为PCDATA。此解决方案提取这些元素并对其进行解码,然后依次解析它们以访问<strong>
元素及其相应值的文本。
请注意,XHTML包含多个元素,因为XHTML只允许使用单个根元素,我将其包含在<root>
中的虚拟$twig->parse("<root>$desc</root>")
元素中。
希望您能够从中推断出您需要的数据。
use strict;
use warnings;
use LWP::Simple;
use XML::Twig;
my $xml = get 'http://www.treasurydirect.gov/RI/TreasuryOfferingAnnouncements.rss';
my $twig = XML::Twig->new;
$twig->parse($xml);
for my $desc ($twig->get_xpath('/rss/channel/item/description')) {
$desc = $desc->text;
my $twig = XML::Twig->new;
$twig->parse("<root>$desc</root>");
for my $strong ($twig->get_xpath('/root/strong')) {
my ($key, $val) = ($strong->trimmed_text, $strong->next_sibling->trimmed_text);
$key =~ s/:$//;
print "$key => $val\n";
}
print "\n";
}
<强>输出强>
CUSIP -> 912810QY7
Term and Type -> 29-Year 11-Month Bond
Offering Amount -> $13,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2042
CUSIP -> 912796DT3
Term and Type -> 3-Day Bill
Offering Amount -> $10,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/14/2012
Maturity Date -> 12/17/2012
CUSIP -> 912828UE8
Term and Type -> 5-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/18/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2017
CUSIP -> 912828UD0
Term and Type -> 2-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2014
CUSIP -> 912796AM1
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 06/20/2013
CUSIP -> 912828UF5
Term and Type -> 7-Year Note
Offering Amount -> $29,000,000,000
Auction Date -> 12/19/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2019
CUSIP -> 912828SQ4
Term and Type -> 4-Year 4-Month TIPS
Offering Amount -> $14,000,000,000
Auction Date -> 12/20/2012
Issue Date -> 12/31/2012
Maturity Date -> 04/15/2017
CUSIP -> 9127957M7
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 03/21/2013
CUSIP -> 912828TY6
Term and Type -> 9-Year 11-Month Note
Offering Amount -> $21,000,000,000
Auction Date -> 12/12/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2022
CUSIP -> 912828UC2
Term and Type -> 3-Year Note
Offering Amount -> $32,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/17/2012
Maturity Date -> 12/15/2015
CUSIP -> 912796AK5
Term and Type -> 52-Week Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 12/12/2013
CUSIP -> 9127955V9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 01/10/2013
CUSIP -> 912796AL3
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 06/13/2013
CUSIP -> 9127957L9
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 03/14/2013
CUSIP -> 912796DT3
Term and Type -> 11-Day Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 12/17/2012
CUSIP -> 9127956Z9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 01/03/2013