我想创建一个多线程Web爬虫,但是在做了一些研究之后我发现gem mechanize
不是多线程安全的。所以我的问题是,是否有可能编写一个多线程爬虫来同时刮掉多个搜索引擎?例如:
def site(url)
Nokogiri::HTML(RestClient.get(url))
end
def parse(url, tag, i)
parsing = site(url)
parsing.css(tag)[i]].to_s
end
Thread.new do
agent = Mechanize.new
# do some searching and start the search
parse('google.com', 'html', 0)
end
Thread.new do
agent = Mechanize.new
# same thing and run them in tandem
parse('duckduckgo.com', 'html', 0)
end