想要在没有taobao API的情况下在搜索结果页面上获取淘宝的产品网址列表

时间:2018-01-30 19:28:16

标签: ruby web-scraping nokogiri

我希望在没有淘宝API的情况下在搜索结果页面上获取淘宝的产品网址列表。

我试过跟随Ruby脚本。

  require "open-uri"
  require "rubygems"
  require "nokogiri"

  url='https://world.taobao.com/search/search.htm?_ksTS=1517338530524_300&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E6%99%BA%E8%83%BD%E6%89%8B%E8%A1%A8&cna=htqfEgp0pnwCATyQWEDB%2FRCE&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895'

  charset = nil
  html = open(url) do |f|
    charset = f.charset
    f.read
  end
  doc = Nokogiri::HTML.parse(html, nil, charset)

  p doc.xpath('//*[@id="list-itemList"]/div/div/ul/li[1]/div/div[1]/div/a/@href').each{|i| puts i.text}

  # => 0

我想获取https://click.simba.taobao.com/cc_im?p=%D6%C7%C4%DC%CA%D6%B1%ED&s=328917633&k=525&e=lDs3%2BStGrhmNjUyxd8vQgTvfT37ERKUkJtUYVk0Fu%2FVZc0vyfhbmm9J7EYm6FR5sh%2BLS%2FyzVVWDh7%2FfsE6tfNMMXhI%2B0UDC%2FWUl0TVvvELm1aVClOoSyIIt8ABsLj0Cfp5je%2FwbwaEz8tmCoZFXvwyPz%2F%2ByQnqo1aHsxssXTFVCsSHkx4WMF4kAJ56h9nOp2im5c3WXYS4sLWfJKNVUNrw%2BpEPOoEyjgc%2Fum8LOuDJdaryOqOtghPVQXDFcIJ70E1c5A%2F3bFCO7mlhhsIlyS%2F6JgcI%2BCdFFR%2BwwAwPq4J5149i5fG90xFC36H%2B6u9EBPvn2ws%2F3%2BHHXRqztKxB9a0FyA0nyd%2BlQX%2FeDu0eNS7syyliXsttpfoRv3qrkLwaIIuERgjVDODL9nFyPftrSrn0UKrE5HoJxUtEjsZNeQxqovgnMsw6Jeaosp7zbesM2QBfpp6NMvKM5e5s1buUV%2F1AkICwRxH7wrUN4%2BFn%2FJ0%2FIDJa4fQd4KNO7J5gQRFseQ9Z1SEPDHzgw%3D之类的网址列表,但我得到的是0

我该怎么办?

1 个答案:

答案 0 :(得分:0)

我不知道淘宝网,但该网页似乎运行了大量的javascript。因此,实际上可能无法使用没有javascript功能的客户端检索内容。因此,您可以尝试使用gem selenium-webdriver来代替open-uri:

https://rubygems.org/gems/selenium-webdriver/versions/2.53.4