在this帖子中,我了解到Ruby / Perl中的Mechanize
比特定示例中的HTML::TreeBuilder 3
更容易使用。
Mechanize
优于HTML::TokeParser
?
使用Mechanize
?
sub get_img_page_urls {
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $req = new HTTP::Request 'GET' => "$url";
$req->header('Accept' => 'text/html');
$response_u = $ua->request($req); # send request
die "Error: ", $response_u->status_line unless $response_u->is_success;
my $stream = HTML::TokeParser->new(\$response_u->content);
my %urls = ();
my $found_thumbnails = 0;
my $found_thumb = 0;
while (my $token = $stream->get_token) {
# <div class="thumb-box" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') {
$found_thumbnails = 1;
}
# <div class="thumb" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') {
$found_thumb = 1;
}
# <a ... >
if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') {
$urls{'http://example.com' . "$token->[2]{href}"} = 1;
# one url have been found. Now start all over.
$found_thumb = 0;
$found_thumbnails = 0;
}
}
return %urls;
}
答案 0 :(得分:5)
任何事情都比HTML :: TokeParser更好,谈到界面。 WWW :: Mechanize闪耀着表单,但它也没有找到某些元素的声明方式。我喜欢Web::Query和HTML::Query在jQuery之后建模他们的界面,据我所知,这种编程很受欢迎。
问题中的程序如下所示。它会自动引发异常,因此不需要显式错误处理。
use URI;
use Web::Query 'wq';
sub get_img_page_urls {
my ($url) = @_;
$Web::Query::UserAgent = LWP::UserAgent->new(agent => 'Mozilla/8.0');
return map {
URI->new($_)->abs('http://example.com')->as_string # hash key
=> 1 # hash value
} wq($url)->find('div.thumb-box div.thumb a')->attr('href');
}
答案 1 :(得分:2)
不确定您是否需要使用Mechanize,因为我认为Nokogiri就足够了。我不知道perl,所以我不完全确定你的例子中是如何列出html的,但我假设它是这样的:
<div class="thumb-box">
...
<div class="thumb">
...
<a href="http://example.com/img/5.jpg">...
</div>
</div>
以下是Nokogiri的代码:
require 'nokogiri'
require 'open-uri'
def get_img_page_urls(url)
urls = []
doc = Nokogiri::HTML(open('http://www.example.com', 'User-Agent' => 'Mozilla/8.0'))
doc.css('div.thumb-box div.thumb a').each do |link|
urls << link.attr("href")
end
urls
end
答案 2 :(得分:2)
Mechanize不仅仅是一个解析器。它添加了一个模拟浏览器,允许您浏览网站,填写表单等。但它还包括一个解析器,使网页抓取非常简单。这是使用ruby Mechanize重写的方法:
def get_img_page_urls(url)
agent = Mechanize.new
agent.user_agent_alias = "Windows Mozilla"
agent.get(url).search("//div[@class='thumb-box']/div[@class='thumb']/a/@href")
end