Sinew(ruby web scraper)示例在我的机器上不起作用

时间:2012-06-17 20:54:26

标签: ruby web-crawler web-scraping nokogiri popen

我正在尝试从sinew源代码中运行示例,但它不能在我的机器上运行。这是样本(直接取自他们的github):

get "http://www.amazon.com/gp/bestsellers/books/ref=sv_b_3"
noko.css(".zg_itemRow").each do |item|
  row = { }
  row[:url] = item.css(".zg_title a").first[:href]
  row[:title] = item.css(".zg_title")
  row[:img] = item.css(".zg_itemImage_normal img").first[:src]
  csv_emit(row)
end

我正在使用带有ruby 1.9.3和rvm的ubuntu 12.04。这是我输入的内容,然后是错误。

jefferton@ubuntu:~/IdeaProjects/sinew_scrape$ sinew sell_list.sinew
curl http://www.amazon.com/gp/bestsellers/books/ref=sv_b_3
/home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/text_util.rb:48:in `popen': No such file or directory - tidy -asxml  -bare  -quiet  -utf8  -wrap 0 --doctype omit --hide-comments yes --force-output yes -f /dev/null (Errno::ENOENT)
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/text_util.rb:48:in `html_tidy'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:33:in `html'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:59:in `noko'
from sell_list.sinew:9:in `_run'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:121:in `instance_eval'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:121:in `_run'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:16:in `initialize'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:19:in `new'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:19:in `block in <top (required)>'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:18:in `each'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:18:in `<top (required)>'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/sinew:19:in `load'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/sinew:19:in `<main>'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/ruby_noexec_wrapper:14:in `eval'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/ruby_noexec_wrapper:14:in `<main>'

我希望我知道更具体的问题,但我不知道该怎么做。

感谢。

2 个答案:

答案 0 :(得分:1)

您必须首先安装Html-Tidy和Curl,请参阅https://github.com/gurgeous/sinew/wiki 你在这里得到的错误是因为找不到Html-Tidy。将其安装到没有空格的文件夹(不是Program Files),并添加系统或用户PATH变量的路径。用卷曲做同样的事。 从命令行测试这两个应用程序,但不在自己的地图中测试它们是否有效。

答案 1 :(得分:1)

该库可能值得研究,但我无法想象他们为什么会使用curl而不是机械化或者html整洁应该是什么。炮轰这样的可执行文件只是一种糟糕的方法。我的意见是避免它,而是使用机械化。