Question

我为最终项目制作了自己的网络抓取工具，可以使用一些帮助。

我正在使用Nokogiri。 web-scraper查找网页上的所有单词并使用字典计算每个单词的频率，然后返回网站上的前十个单词。我可以根据需要传入尽可能多的网站，但它仍然可以使用。所以我可以传递它http://fox.com，http/cnbc.com等。该程序适用于这些网站，但对于某些网站我收到错误。例如，http://facebook不起作用，它表示禁止重定向。

到目前为止，这是我的代码：

require 'rubygems'
require 'nokogiri'
require 'open-uri'

class Scraper

  attr_accessor :url, :words, :arguments

  def initialize(*args)
    @words = Hash.new("No Match Found")
    @arguments = args
    compiler
    print_results
  end

  def mechansim(site)
    boring_words = ["the","to", "in","if","of","all","and","the","for","news","is","on","a","this","with","at","continue","more","be","from","could","as","by","he","she","who","what","not",
      "newswidget","newswidgetfooter","pm"]
    page = Nokogiri::HTML(open(site))
    page.search('script').each {|el| el.unlink}
    links = page.css('body').inner_text.downcase.gsub(/[^0-9a-z ]/i, '').split(' ')
    links.each do |x|
      if @words.has_key?(x) === true && boring_words.include?(x) === false
        @words[x] += 1
      else 
        @words[x] =1
      end
    end
    if @arguments[0].length > 0
      compiler
    end
  end

  def compiler
    @arguments.each do |argument| 
      argument = argument[0]
      site = argument
      arguments[0].shift
      mechansim(site)
    end
  end


  def print_results
    puts "------------------------------------------------------------------"
    @words = @words.sort_by {|k, v| v}.reverse.to_h 
    print @words.take(20)
    puts "------------------------------------------------------------------"
  end

end

Scraper.new(["http://foxnews.com"])

Answer 1

使用facebook url的HTTPS版本：https://facebook.com。

Nokogiri Ruby Gem不适用于所有网址

1 个答案: