我有以下脚本,它刮擦我的学校CS部门以获取所有课程的列表。我希望能够将CRN(课程编号)和其他重要信息提取到数据库中,我可以让用户浏览网络应用程序。
以下是一个示例网址: http://courses.illinois.edu/cis/2011/spring/schedule/CS/411.html
我想从这样的网页中提取信息。刮刀的第一级只是从所有课程列表中构建各个站点。一旦我在课程特定目录页面,我使用第二个刮刀尝试获得我想要的所有这些信息。出于某种原因,虽然CRN和课程讲师都是'td'元素。刮刮时,我的刮刀似乎没有返回任何东西。我试图专门为'div'抓取,我得到了每个相关页面的一堆信息。所以不知怎的,我没有得到'td'元素,但我正在从正确的页面中抓取。
my $tweets = scraper {
# Parse all LIs with the class "status", store them into a resulting
# array 'tweets'. We embed another scraper for each tweet.
# process "h4.ws-ds-name.detail-title", "array[]" => 'TEXT';
process "div.ws-row", "array[]" => 'TEXT';
};
my $res = $tweets->scrape( URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169") );
foreach my $elem (@{$res->{array}}){
my $coursenum = substr($elem,2,4);
my $secondLevel = scraper{
process "td.ws-row", "array2[]" => 'TEXT';
};
my $res2 = $secondLevel->scrape(URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/$coursenum.html"));
my $num = @{$res2->{array2}};
print $num;
print "---------------------", "\n";
my @curr = @{$res2->{array2}};
foreach my $elem2 (@curr){
$num++;
print $elem2, " ", "\n";
}
print "---------------------", "\n";
}
有什么想法吗?
由于
答案 0 :(得分:1)
在我看来像是
my $coursenum = substr($elem,2,4)
应该是
my $coursenum = substr($elem,3,3)
答案 1 :(得分:1)
在这种情况下最简单的方法是使用
HTML::TableExtract
如果您只是从表中查找数据。
答案 2 :(得分:1)
我玩了一下你的问题。您可以在初始刮刀中获取课程ID,标题和指向单个课程页面的链接:
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '@href';
};
result 'course';
};
抓取的结果是带有hashrefs的arrayref,如下所示:
{ id => "CS 103",
title => "Introduction to Programming",
link => bless(do{\(my $o = "http://courses.illinois.edu/cis/2011/spring/schedule/CS/103.html?skinId=2169")}, "URI::http"),
},
....
然后,您可以从各自的页面对每个课程进行额外的抓取,并将这些信息添加到原始结构中:
for my $course (@$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
@{$course}{ keys %$additional_data } = values %$additional_data;
}
来源合并如下:
use strict; use warnings;
use URI;
use Web::Scraper;
use Data::Dump qw(dump);
my $url = 'http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169';
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '@href';
};
result 'course';
};
my $res = $courses->scrape(URI->new($url));
for my $course (@$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
@{$course}{ keys %$additional_data } = values %$additional_data;
}
dump $res;