使用PERL将Yahoo Financial Corporate Bond数据提取到mysql

时间:2014-01-18 23:10:31

标签: mysql perl text-extraction

我想为所有多个页面提取下表中列出的每一行的个人资料信息:

http://reports.finance.yahoo.com/z1?b=1&so=a&sf=m&tc=1&stt=-&pr=0&cpl=-1&cpu=-1&yl=-1&yu=-1&ytl=-1&ytu=-1&mtl=-1&mtu=-1&rl=5&ru=-1&cll=0

以下是表格中列出的其中一条线的链接之一的示例(全部位于“问题”列中):

http://reports.finance.yahoo.com/z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000

我想存储mysql数据库中所有行和页面的每个问题所包含的所有信息。我认为PERL将是一个很好用的工具,但我对它的体验非常有限。

我想我需要在表格的所有页面(当时是2600多页)的问题栏中收集所有链接,并以某种方式从链接中的每个页面中提取信息。 / p>

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:0)

这将让您以某种方式开始,并向您展示使用正则表达式执行此操作的一般技巧(如果您不熟悉perl和正则表达式匹配,则可能很难理解。)

我只为第一页我确实在我的代码中添加了尽可能多的评论来帮助您理解它。如果您无法理解此代码实际执行的操作,我建议您尝试使用其他工具(或者尝试使用Web::ScraperMojo::DOM等模块。 如果你真的想在perl中完成你的工作,请阅读一些perl文档。

http://perldoc.perl.org/perlre.html

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use LWP::Simple;
use feature 'say';

my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $page_content = get($start_url);
die "Oops, something went wrong!" unless defined $page_content;

process_bond_results_page($page_content);

sub process_bond_results_page {
    my $content = shift;
    # iterates $content as long as /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g regex matches                                                                                                                                                                                 
    # puts row content (content between <tr...>(...)</tr> in a special $1 variable)                                                                                                                                                                                                     
    while($content =~ /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g) {
        # uncomment line below to see what $1 contains                                                                                                                                                                                                                        
        # say $1;                                                                                                                                                                                                                                                             

        # cleanup not needed HTML tags                                                                                                                                                                                                                                        
        my $tr_data = cleanup_html_tags($1);

        # match content in between <td> & </td> tags and put them on @tds list                                                                                                                                                                                                
        my (@tds) = $tr_data =~ /<td>(.*?)<\/td>/g;

        # 2nd element of @tds list contains <a href="link_to_issue">ISSUE NAME</a> text                                                                                                                                                                                       
        # Line below extracts link_to_issue and $issue_name and assigns them to respective variables                                                                                                                                                                          
        my ($link_to_issue, $issue_name) = $tds[1] =~ /<a[^>]*?href=\"([^\"]*?)\"[^>]*?>(.+?)<\/a>/g;

        # Replace 2nd element of list that contains data like <a href="link_to_issue">ISSUE NAME</a>                                                                                                                                                                          
        # with just ISSUE NAME                                                                                                                                                                                                                                                
        $tds[1] = $issue_name;

        # Append $link_to_issue at the end of @tds list                                                                                                                                                                                                                       
        push(@tds,$link_to_issue);

        # Print @tds array with values seaparated by TABs                                                                                                                                                                                                                     
        say join("\t", @tds);
    }

    # Does it have Next link?                                                                                                                                                                                                                                                 
    my ($next_link) = $content =~ /<a[^>]*?href=\"([^\"]+?)\">Next<\/a><\/b>/g;
    say 'NEXT: ' . $next_link if $next_link;

    return;
}

sub cleanup_html_tags {
    my $html = shift;
    $html =~ s/<\/?(font|div)[^>]*?>//g; # remove <font...>, <div...>, </font>, </div>                                                                                                                                                                                        
    $html =~ s/<td[^>]*?>/<td>/g;        # replace all <td...> with just <td>                                                                                                                                                                                                 
    $html =~ s/<\/?nobr>//g;             # remove <nobr> and </nobr>                                                                                                                                                                                                          
    return $html;
}

上面会打印:

Corp    MERRILL LYNCH CO INC MTN BE 100.63  5.000    3-Feb-2014 -19.649 4.969   A   No  /z2?ce=5314754150501796218050&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CME GROUP INC   100.84  5.750   15-Feb-2014 -8.334  5.702   AA  No  /z2?ce=5715449144561716016149&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CAPITAL ONE BK MTN BE   100.80  5.125   15-Feb-2014 -8.334  5.084   A   No  /z2?ce=5715254147581635317455&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HESS CORP   100.92  7.000   15-Feb-2014 -8.351  6.937   BBB No  /z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    PACCAR INC  100.90  6.875   15-Feb-2014 -8.295  6.813   A   No  /z2?ce=5214751144551836016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    WACHOVIA CORP NEW   100.78  4.875   15-Feb-2014 -8.337  4.837   A   No  /z2?ce=4915445142581546016054&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    CATERPILLAR FINL SVCS MTNS BE   100.89  6.125   17-Feb-2014 -7.597  6.071   A   No  /z2?ce=5715245150561764615951&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    KRAFT FOODS INC 100.97  6.750   19-Feb-2014 -6.921  6.685   BBB No  /z2?ce=5315654144531746017754&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    WESTERN UN CO   101.05  6.500   26-Feb-2014 -5.154  6.432   BBB No  /z2?ce=4915145143581556015548&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    AMERICA MOVIL SAB DE CV 101.06  5.500    1-Mar-2014 -4.615  5.443   A   No  /z2?ce=5815451145541816015954&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HARTFORD FINL SVCS GROUP INC    100.96  4.750    1-Mar-2014 -4.454  4.705   BBB No  /z2?ce=5415548146571526017250&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HEWLETT PACKARD CO  101.12  6.125    1-Mar-2014 -4.599  6.057   BBB No  /z2?ce=5415446149551516016556&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    RYDER SYS MTN BE    101.08  5.850    1-Mar-2014 -4.495  5.788   BBB No  /z2?ce=5114851146531605117352&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    HSBC FIN CORP HSBC FIN  100.72  2.000   15-Mar-2014 -3.011  1.986   A   No  /z2?ce=5415650149491807117451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp    SYSCO CORP  101.06  4.600   15-Mar-2014 -2.772  4.552   A   No  /z2?ce=5014953143561486015756&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
NEXT: z1?b=2&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000

答案 1 :(得分:0)

自user3195726建议以来,此处正在使用Mojo::UserAgentMojo::DOM

#!/usr/bin/perl                                                                                                                                                                                                                                                               
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;

my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';

my $dom = Mojo::UserAgent->new->get($start_url)->res->dom;
$dom->find('tr.yfnc_tabledata1')->each(sub{
  my $tds = $_->find('td');
  my $anchor = $tds->[1]->at('a');
  my $link = $anchor->{href};
  my $name = $anchor->all_text;
  $tds = $tds->all_text;
  $tds->[1] = $name;
  push @$tds, $link;
  say $tds->join("\t");
});

say 'Next: ' . $dom->find('a')->first(sub{ $_->all_text eq 'Next'})->{href};

查找全部使用CSS3 selectors,其余的只是变换。