我想为所有多个页面提取下表中列出的每一行的个人资料信息:
以下是表格中列出的其中一条线的链接之一的示例(全部位于“问题”列中):
我想存储mysql数据库中所有行和页面的每个问题所包含的所有信息。我认为PERL将是一个很好用的工具,但我对它的体验非常有限。
我想我需要在表格的所有页面(当时是2600多页)的问题栏中收集所有链接,并以某种方式从链接中的每个页面中提取信息。 / p>
非常感谢任何帮助。
答案 0 :(得分:0)
这将让您以某种方式开始,并向您展示使用正则表达式执行此操作的一般技巧(如果您不熟悉perl和正则表达式匹配,则可能很难理解。)
我只为第一页而我确实在我的代码中添加了尽可能多的评论来帮助您理解它。如果您无法理解此代码实际执行的操作,我建议您尝试使用其他工具(或者尝试使用Web::Scraper或Mojo::DOM等模块。 如果你真的想在perl中完成你的工作,请阅读一些perl文档。
http://perldoc.perl.org/perlre.html
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use feature 'say';
my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $page_content = get($start_url);
die "Oops, something went wrong!" unless defined $page_content;
process_bond_results_page($page_content);
sub process_bond_results_page {
my $content = shift;
# iterates $content as long as /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g regex matches
# puts row content (content between <tr...>(...)</tr> in a special $1 variable)
while($content =~ /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g) {
# uncomment line below to see what $1 contains
# say $1;
# cleanup not needed HTML tags
my $tr_data = cleanup_html_tags($1);
# match content in between <td> & </td> tags and put them on @tds list
my (@tds) = $tr_data =~ /<td>(.*?)<\/td>/g;
# 2nd element of @tds list contains <a href="link_to_issue">ISSUE NAME</a> text
# Line below extracts link_to_issue and $issue_name and assigns them to respective variables
my ($link_to_issue, $issue_name) = $tds[1] =~ /<a[^>]*?href=\"([^\"]*?)\"[^>]*?>(.+?)<\/a>/g;
# Replace 2nd element of list that contains data like <a href="link_to_issue">ISSUE NAME</a>
# with just ISSUE NAME
$tds[1] = $issue_name;
# Append $link_to_issue at the end of @tds list
push(@tds,$link_to_issue);
# Print @tds array with values seaparated by TABs
say join("\t", @tds);
}
# Does it have Next link?
my ($next_link) = $content =~ /<a[^>]*?href=\"([^\"]+?)\">Next<\/a><\/b>/g;
say 'NEXT: ' . $next_link if $next_link;
return;
}
sub cleanup_html_tags {
my $html = shift;
$html =~ s/<\/?(font|div)[^>]*?>//g; # remove <font...>, <div...>, </font>, </div>
$html =~ s/<td[^>]*?>/<td>/g; # replace all <td...> with just <td>
$html =~ s/<\/?nobr>//g; # remove <nobr> and </nobr>
return $html;
}
上面会打印:
Corp MERRILL LYNCH CO INC MTN BE 100.63 5.000 3-Feb-2014 -19.649 4.969 A No /z2?ce=5314754150501796218050&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CME GROUP INC 100.84 5.750 15-Feb-2014 -8.334 5.702 AA No /z2?ce=5715449144561716016149&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CAPITAL ONE BK MTN BE 100.80 5.125 15-Feb-2014 -8.334 5.084 A No /z2?ce=5715254147581635317455&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HESS CORP 100.92 7.000 15-Feb-2014 -8.351 6.937 BBB No /z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp PACCAR INC 100.90 6.875 15-Feb-2014 -8.295 6.813 A No /z2?ce=5214751144551836016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp WACHOVIA CORP NEW 100.78 4.875 15-Feb-2014 -8.337 4.837 A No /z2?ce=4915445142581546016054&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CATERPILLAR FINL SVCS MTNS BE 100.89 6.125 17-Feb-2014 -7.597 6.071 A No /z2?ce=5715245150561764615951&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp KRAFT FOODS INC 100.97 6.750 19-Feb-2014 -6.921 6.685 BBB No /z2?ce=5315654144531746017754&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp WESTERN UN CO 101.05 6.500 26-Feb-2014 -5.154 6.432 BBB No /z2?ce=4915145143581556015548&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp AMERICA MOVIL SAB DE CV 101.06 5.500 1-Mar-2014 -4.615 5.443 A No /z2?ce=5815451145541816015954&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HARTFORD FINL SVCS GROUP INC 100.96 4.750 1-Mar-2014 -4.454 4.705 BBB No /z2?ce=5415548146571526017250&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HEWLETT PACKARD CO 101.12 6.125 1-Mar-2014 -4.599 6.057 BBB No /z2?ce=5415446149551516016556&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp RYDER SYS MTN BE 101.08 5.850 1-Mar-2014 -4.495 5.788 BBB No /z2?ce=5114851146531605117352&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HSBC FIN CORP HSBC FIN 100.72 2.000 15-Mar-2014 -3.011 1.986 A No /z2?ce=5415650149491807117451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp SYSCO CORP 101.06 4.600 15-Mar-2014 -2.772 4.552 A No /z2?ce=5014953143561486015756&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
NEXT: z1?b=2&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000
答案 1 :(得分:0)
自user3195726建议以来,此处正在使用Mojo::UserAgent和Mojo::DOM
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $dom = Mojo::UserAgent->new->get($start_url)->res->dom;
$dom->find('tr.yfnc_tabledata1')->each(sub{
my $tds = $_->find('td');
my $anchor = $tds->[1]->at('a');
my $link = $anchor->{href};
my $name = $anchor->all_text;
$tds = $tds->all_text;
$tds->[1] = $name;
push @$tds, $link;
say $tds->join("\t");
});
say 'Next: ' . $dom->find('a')->first(sub{ $_->all_text eq 'Next'})->{href};
查找全部使用CSS3 selectors,其余的只是变换。