我为最终项目制作了自己的网络抓取工具,可以使用一些帮助。
我正在使用Nokogiri。 web-scraper查找网页上的所有单词并使用字典计算每个单词的频率,然后返回网站上的前十个单词。我可以根据需要传入尽可能多的网站,但它仍然可以使用。所以我可以传递它http://fox.com
,http/cnbc.com
等。该程序适用于这些网站,但对于某些网站我收到错误。例如,http://facebook
不起作用,它表示禁止重定向。
到目前为止,这是我的代码:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
class Scraper
attr_accessor :url, :words, :arguments
def initialize(*args)
@words = Hash.new("No Match Found")
@arguments = args
compiler
print_results
end
def mechansim(site)
boring_words = ["the","to", "in","if","of","all","and","the","for","news","is","on","a","this","with","at","continue","more","be","from","could","as","by","he","she","who","what","not",
"newswidget","newswidgetfooter","pm"]
page = Nokogiri::HTML(open(site))
page.search('script').each {|el| el.unlink}
links = page.css('body').inner_text.downcase.gsub(/[^0-9a-z ]/i, '').split(' ')
links.each do |x|
if @words.has_key?(x) === true && boring_words.include?(x) === false
@words[x] += 1
else
@words[x] =1
end
end
if @arguments[0].length > 0
compiler
end
end
def compiler
@arguments.each do |argument|
argument = argument[0]
site = argument
arguments[0].shift
mechansim(site)
end
end
def print_results
puts "------------------------------------------------------------------"
@words = @words.sort_by {|k, v| v}.reverse.to_h
print @words.take(20)
puts "------------------------------------------------------------------"
end
end
Scraper.new(["http://foxnews.com"])
答案 0 :(得分:0)
使用facebook url的HTTPS版本:https://facebook.com
。