Nokogiri:无法屏幕刮一页(taobao.com)

时间:2013-12-20 13:12:44

标签: ruby-on-rails screen-scraping nokogiri

我正在使用nokogiri从中文网站(淘宝网)获取图片:

  url = "http://item.taobao.com/item.htm?spm=a1z10.1.w137644-1960500098.43.d7Uwpx&id=36246359192"
  doc = Nokogiri::HTML(open(url) )
  puts doc.css("title").text
  puts doc.css("img")[0]['src']
  puts doc.css("img#J_ImgBooth")[0]['src']

我可以获得标题和doc.css("img")[0]['src'],但我无法获得img#J_ImgBooth。问题是什么?是以某种方式阻止了吗?

2 个答案:

答案 0 :(得分:1)

看看html源代码,没有src但img的数据-src属性#J_ImgBooth

<img id="J_ImgBooth" data-src="http://img03.taobaocdn.com/bao/uploaded/i3/18513032853503639/T1z1ojXdNhXXXXXXXX_!!2-item_pic.png_310x310.jpg"  data-hasZoom="700" />

使用

doc.css("img#J_ImgBooth")[0]['data-src']

会好的。

答案 1 :(得分:1)

这对我有用:

doc.at_css("#J_ImgBooth")["data-src"]

您可以检查属性名称是data-src

#(Element:0x3ffb5d3d9df0 {
  name = "img",
  attributes = [
    #(Attr:0x3ffb5d3d9b84 { name = "id", value = "J_ImgBooth" }),
    #(Attr:0x3ffb5d3d9b70 {
      name = "data-src",
      value = "http://img03.taobaocdn.com/bao/uploaded/i3/18513032853503639/T1z1ojXdNhXXXXXXXX_!!2-item_pic.png_310x310.jpg"
      }),
    #(Attr:0x3ffb5d3d9b5c { name = "data-haszoom", value = "700" })]
  })