使用Mechanize可以使这更容易吗?

时间:2011-12-10 23:32:20

标签: ruby perl mechanize

this帖子中,我了解到Ruby / Perl中的Mechanize比特定示例中的HTML::TreeBuilder 3更容易使用。

Mechanize优于HTML::TokeParser

使用Mechanize

,下面的Ruby也更容易编写
sub get_img_page_urls {
    my $url = shift;

    my $ua = LWP::UserAgent->new;
    $ua->agent("$0/0.1 " . $ua->agent);
    $ua->agent("Mozilla/8.0");

    my $req = new HTTP::Request 'GET' => "$url";
    $req->header('Accept' => 'text/html');

    $response_u = $ua->request($req);  # send request

    die "Error: ", $response_u->status_line unless $response_u->is_success;

    my $stream = HTML::TokeParser->new(\$response_u->content);

    my %urls = ();

    my $found_thumbnails = 0;
    my $found_thumb = 0;

    while (my $token = $stream->get_token) {

        # <div class="thumb-box" ... >
        if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') {
            $found_thumbnails = 1;
        }

        # <div class="thumb" ... >
        if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') {
            $found_thumb = 1;
        }

        #                                          <a ... >
        if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') {
            $urls{'http://example.com' . "$token->[2]{href}"} = 1;

            # one url have been found. Now start all over.
            $found_thumb = 0;
            $found_thumbnails = 0;
        }

    }

    return %urls;
}

3 个答案:

答案 0 :(得分:5)

任何事情都比HTML :: TokeParser更好,谈到界面。 WWW :: Mechanize闪耀着表单,但它也没有找到某些元素的声明方式。我喜欢Web::QueryHTML::Query在jQuery之后建模他们的界面,据我所知,这种编程很受欢迎。

问题中的程序如下所示。它会自动引发异常,因此不需要显式错误处理。

use URI;
use Web::Query 'wq';

sub get_img_page_urls {
    my ($url) = @_;
    $Web::Query::UserAgent = LWP::UserAgent->new(agent => 'Mozilla/8.0');

    return map {
        URI->new($_)->abs('http://example.com')->as_string   # hash key
        => 1                                                 # hash value
    } wq($url)->find('div.thumb-box div.thumb a')->attr('href');
}

之前发布为评论https://stackoverflow.com/q/8274221#comment-10196381

答案 1 :(得分:2)

不确定您是否需要使用Mechanize,因为我认为Nokogiri就足够了。我不知道perl,所以我不完全确定你的例子中是如何列出html的,但我假设它是这样的:

<div class="thumb-box">
  ...
  <div class="thumb">
    ...
    <a href="http://example.com/img/5.jpg">...
  </div>
</div>

以下是Nokogiri的代码:

require 'nokogiri'
require 'open-uri'

def get_img_page_urls(url)
  urls = []
  doc = Nokogiri::HTML(open('http://www.example.com', 'User-Agent' => 'Mozilla/8.0'))
  doc.css('div.thumb-box div.thumb a').each do |link|
    urls << link.attr("href")
  end

  urls
end

答案 2 :(得分:2)

Mechanize不仅仅是一个解析器。它添加了一个模拟浏览器,允许您浏览网站,填写表单等。但它还包括一个解析器,使网页抓取非常简单。这是使用ruby Mechanize重写的方法:

def get_img_page_urls(url)
  agent = Mechanize.new
  agent.user_agent_alias = "Windows Mozilla"
  agent.get(url).search("//div[@class='thumb-box']/div[@class='thumb']/a/@href")
end