解析rss feed,description字段

时间:2012-12-12 13:54:55

标签: perl parsing rss

我在perl中使用SimpleXml来提取标记

中的数据
<description>&lt;strong&gt;CUSIP:&lt;/strong&gt; 912828UC2&lt;br /&gt;&lt;strong&gt;Term and Type:&lt;/strong&gt; 3-Year Note&lt;br /&gt;&lt;strong&gt;Offering Amount:&lt;/strong&gt; $32,000,000,000&lt;br /&gt;&lt;strong&gt;Auction Date:&lt;/strong&gt; 12/11/2012&lt;br /&gt;&lt;strong&gt;Issue Date:&lt;/strong&gt; 12/17/2012&lt;br /&gt;&lt;strong&gt;Maturity Date:&lt;/strong&gt; 12/15/2015&lt;br /&gt;&lt;a href="http://www.treasurydirect.gov/instit/annceresult/press/preanre/2012/A_20121206_6.pdf"&gt;PDF version of the announcement&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.treasurydirect.gov/xml/A_20121206_6.xml"&gt;XML version of the announcement&lt;/a&gt;&lt;br /&gt;</description>

我现在无法提取单个符号。例如,对于拍卖日期,请使用

  

if ($desc=~m/Auction\sDate:<\/strong>\s+(\d\d\/\d\d\/\d\d\d\d)<br/)   {

     

}

但我觉得它不够健壮。提取字段的标准方法是什么?

3 个答案:

答案 0 :(得分:2)

正如Dan1111在他的回答中指出的那样,如果您已经在使用XML解析器(Simple :: XML?),那么您应该坚持使用它来解析描述标记中的数据。尝试从XML / HTML提要中解析数据不是一个好主意;使用为此目的而构建的解析器。

由于帖子中数据的格式化,我假设您没有解析器可以帮助您的有效HTML。在这种情况下,没有“标准”的方法来提取字段,但这是我解决这个问题的方式:

print "$desc\n";

my @parts = split(/;br /, $desc);
my %dates;

foreach my $part (@parts) {
  if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
    $dates{$1} = $2;
  }
}

foreach my $label (keys %dates) {
  printf "%-16s%12s\n", "${label}:", $dates{$label};
}

查看原始字符串,我可以看到有3个日期和其他几个记录,因此首先要做的是split它们。我发现字符串中的每条记录都由字符'; br'分隔,所以我用它来分割:

my @parts = split(/;br /, $desc);

执行此操作后,我有一个数组,其中包含字符串中的每个不同数据部分。现在,我只需要解析每个部分。因为您的问题对拍卖日期值感兴趣,我写了一个将捕获日期的正则表达式。期待其他日期也可能有价值,我修改了我的正则表达式,以便我可以捕获标签(拍卖,发行,成熟度),并将每个标签日期对存储在一个哈希值(%date)中:

foreach my $part (@parts) {
  if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
    $dates{$1} = $2;
  }
}

最后,我刚打印出我的哈希:

foreach my $part (@parts) {
  if ($part =~ m/(\w+\s+Date).+(\d{2}\/\d{2}\/\d{4})/) {
    $dates{$1} = $2;
  }
}  

有意义吗?

答案 1 :(得分:0)

更强大的内容取决于您的预期输入和您正在寻找的内容。但是,您可能会发现这些内容很有帮助。

我使用了XML::Twig。由于各种怪癖,XML::Simple(我假设你现在正在使用它)不推荐用于新开发。

use Modern::Perl;
use XML::Twig;

my $twig = XML::Twig->new();
$twig->parse(<DATA>);

my %params;
my $key;
for my $child (map {$_->text} $twig->root->children)
{
    if ($child =~ /(.*):/)
    {
        $key = $1;  
    }
    else
    {
        $params{$key} = $child if (defined $key);
        undef $key;         
    }
}

say "$_ is $params{$_}" foreach (keys %params); 

__DATA__
<description><strong>CUSIP:</strong> 912828UC2<br /><strong>Term and Type:</strong> 3-Year Note<br /><strong>Offering Amount:</strong> $32,000,000,000<br /><strong>Auction Date:</strong> 12/11/2012<br /><strong>Issue Date:</strong> 12/17/2012<br /><strong>Maturity Date:</strong> 12/15/2015<br /><a href="http://www.treasurydirect.gov/instit/annceresult/press/preanre/2012/A_20121206_6.pdf">PDF version of the announcement</a><br /><a href="http://www.treasurydirect.gov/xml/A_20121206_6.xml">XML version of the announcement</a><br /></description>

这将以冒号结尾的任何元素作为键,然后假定树中的下一个元素是值。显然,这会假设您将获得什么样的输入,但只要所有“关键”元素都包含在标记中,它就会非常强大。

另一种方法是首先剥离所有标签,然后在文本中搜索键值对。您也可以使用XML::Twig执行此操作;只需调用$twig->root->text即可获取整个元素的文本。但是,在这种方法中,确定一个键的结束位置和另一个值的开始将是棘手的。

答案 2 :(得分:0)

您显示的RSS Feed中的<description>元素包含有效的XHTML片段作为PCDATA。此解决方案提取这些元素并对其进行解码,然后依次解析它们以访问<strong>元素及其相应值的文本。

请注意,XHTML包含多个元素,因为XHTML只允许使用单个根元素,我将其包含在<root>中的虚拟$twig->parse("<root>$desc</root>")元素中。

希望您能够从中推断出您需要的数据。

use strict;
use warnings;

use LWP::Simple;
use XML::Twig;

my $xml = get 'http://www.treasurydirect.gov/RI/TreasuryOfferingAnnouncements.rss';

my $twig = XML::Twig->new;
$twig->parse($xml);

for my $desc ($twig->get_xpath('/rss/channel/item/description')) {
  $desc = $desc->text;
  my $twig = XML::Twig->new;
  $twig->parse("<root>$desc</root>");
  for my $strong ($twig->get_xpath('/root/strong')) {
    my ($key, $val) = ($strong->trimmed_text, $strong->next_sibling->trimmed_text);
    $key =~ s/:$//;
    print "$key => $val\n";
  }
  print "\n";
}

<强>输出

CUSIP -> 912810QY7
Term and Type -> 29-Year 11-Month Bond
Offering Amount -> $13,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2042

CUSIP -> 912796DT3
Term and Type -> 3-Day Bill
Offering Amount -> $10,000,000,000
Auction Date -> 12/13/2012
Issue Date -> 12/14/2012
Maturity Date -> 12/17/2012

CUSIP -> 912828UE8
Term and Type -> 5-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/18/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2017

CUSIP -> 912828UD0
Term and Type -> 2-Year Note
Offering Amount -> $35,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2014

CUSIP -> 912796AM1
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 06/20/2013

CUSIP -> 912828UF5
Term and Type -> 7-Year Note
Offering Amount -> $29,000,000,000
Auction Date -> 12/19/2012
Issue Date -> 12/31/2012
Maturity Date -> 12/31/2019

CUSIP -> 912828SQ4
Term and Type -> 4-Year 4-Month TIPS
Offering Amount -> $14,000,000,000
Auction Date -> 12/20/2012
Issue Date -> 12/31/2012
Maturity Date -> 04/15/2017

CUSIP -> 9127957M7
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/17/2012
Issue Date -> 12/20/2012
Maturity Date -> 03/21/2013

CUSIP -> 912828TY6
Term and Type -> 9-Year 11-Month Note
Offering Amount -> $21,000,000,000
Auction Date -> 12/12/2012
Issue Date -> 12/17/2012
Maturity Date -> 11/15/2022

CUSIP -> 912828UC2
Term and Type -> 3-Year Note
Offering Amount -> $32,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/17/2012
Maturity Date -> 12/15/2015

CUSIP -> 912796AK5
Term and Type -> 52-Week Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 12/12/2013

CUSIP -> 9127955V9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/11/2012
Issue Date -> 12/13/2012
Maturity Date -> 01/10/2013

CUSIP -> 912796AL3
Term and Type -> 26-Week Bill
Offering Amount -> $28,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 06/13/2013

CUSIP -> 9127957L9
Term and Type -> 13-Week Bill
Offering Amount -> $32,000,000,000
Auction Date -> 12/10/2012
Issue Date -> 12/13/2012
Maturity Date -> 03/14/2013

CUSIP -> 912796DT3
Term and Type -> 11-Day Bill
Offering Amount -> $25,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 12/17/2012

CUSIP -> 9127956Z9
Term and Type -> 4-Week Bill
Offering Amount -> $40,000,000,000
Auction Date -> 12/04/2012
Issue Date -> 12/06/2012
Maturity Date -> 01/03/2013