Question

我是Ruby新手，正在使用Nokogiri解析html网页。函数到达该行时会抛出错误：

currentPage = Nokogiri::HTML(open(url))

我已经验证了函数的输入，url是一个带有webaddress的字符串。我之前提到的这条线在功能之外使用时完全符合预期，但不在内部。当它到达函数内的那一行时，抛出以下错误：

WebCrawler.rb:25:in `explore': undefined method `+@' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'

有问题的行所在的功能粘贴在下面。

def explore(url)
    if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
            return
    end
    CRAWLED_PAGES_COUNTER++

    currentPage = Nokogiri::HTML(open(url))
    links = currentPage.xpath('//@href').map(&:value)

    eval_page(currentPage)

    links.each do|link|
            puts link
            explore(link)
    end
end

这是完整的程序（它不会长得多）：

require 'nokogiri'
require 'open-uri'

#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5

#Crawler Functions
def explore(url)
    if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
            return
    end
    CRAWLED_PAGES_COUNTER++

    currentPage = Nokogiri::HTML(open(url))
    links = currentPage.xpath('//@href').map(&:value)

    eval_page(currentPage)

    links.each do|link|
            puts link
            explore(link)
    end
end

def eval_page(page)
    puts page.title
end

#Start Crawling


explore(START_URL)

Answer 1

require 'nokogiri'
require 'open-uri'

#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5

#Crawler Functions
def explore(url)
    if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
            return
    end
    $CRAWLED_PAGES_COUNTER+=1

    currentPage = Nokogiri::HTML(open(url))
    links = currentPage.xpath('//@href').map(&:value)

    eval_page(currentPage)

    links.each do|link|
            puts link
            explore(link)
    end
end

def eval_page(page)
    puts page.title
end

#Start Crawling


explore($START_URL)

Answer 2

只是为了给你建立一些东西，这是一个只收获和访问链接的简单蜘蛛。修改它以做其他事情很容易。

require 'nokogiri'
require 'open-uri'
require 'set'

BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds

urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new

until urls.empty?
  this_uri = URI.join(last_host, urls.shift)
  next if visited_urls.include?(this_uri)

  puts "Scanning: #{this_uri}"

  doc = Nokogiri::HTML(this_uri.open)
  visited_urls << this_uri

  if visited_hosts.include?(this_uri.host)
    puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
    sleep SLEEP_TIME
  end

  visited_hosts << this_uri.host

  urls += doc.search('[href]').map { |node|
    node['href'] 
  }.select { |url|
    extension = File.extname(URI.parse(url).path)
    extension[/\.html?$/] || extension.empty?
  }

  last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
  puts "#{urls.size} URLs remain."
end

有：

适用于http://example.com。该网站的设计和指定用于试验。
检查以前是否访问了某个页面，并且不会再次扫描该页面。这是一个天真的检查，将被包含不一致顺序的查询或查询的URL所欺骗。
检查以前是否访问过某个站点，如果是，则自动限制页面检索。它可能被别名欺骗。
检查页面是否以“.htm”，“。html”结尾或没有扩展名。其他任何事都被忽略了。

编写工业强度蜘蛛的实际代码更为复杂。需要尊重Robots.txt文件，弄清楚如何处理通过HTTP超时重定向到其他页面的页面或JavaScript重定向是一项有趣的任务，处理格式错误的页面是一项挑战....

Nokogiri在函数中抛出异常但不在函数之外

2 个答案: