从ruby中的字符串中提取img源链接

时间:2012-06-10 16:51:17

标签: ruby parsing web-scraping nokogiri

我有这个字符串

#<Fletcher::Model::Amazon alt="You Are Not a Gadget: A Manifesto (Vintage)" border="0" element="img" height="240" id="prodImage" onload="if (typeof uet == 'function') { if(typeof setCSMReq=='function'){setCSMReq('af');setCSMReq('cf');}else{uet('af');uet('cf');amznJQ.completedStage('amznJQ.AboveTheFold');} }" onmouseout="sitb_doHide('bookpopover'); return false;" onmouseover="sitb_showLayer('bookpopover'); return false;" src="http://ecx.images-amazon.com/images/I/51bpl1wA%2BaL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg" width="240">

我只想要src属性中的链接:

http://ecx.images-amazon.com/images/I/51bpl1wA%2BaL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg"

如何解析此字符串以获取链接

以下是相关功能列表

module Fletcher
  module Model
    class Amazon < Fletcher::Model::Base
      # A regular expression for determining if a url comes from a specific service/website
      def self.regexp
        /amazon\.com/
      end

      # Parse data and look for object attributes to give to object    
      def parse(data)
        super(data)

        case doc
        when Nokogiri::HTML::Document
          # Get Name
          self.name = doc.css("h1.parseasinTitle").first_string

          # Get Description
          self.description = doc.css("div#productDescriptionWrapper").first_string    

          # Get description from meta title if not found
          self.description = doc.xpath("//meta[@name='description']/@content").first_string if description.nil?

          # Get Price
          parse_price(doc.css("b.priceLarge").first_string)

          # Get Images
          self.images = doc.xpath("//table[@class='productImageGrid']//img").attribute_array
          self.image = images.first
        end            
      end
    end
  end
end

2 个答案:

答案 0 :(得分:1)

require 'open-uri'

x = %Q{#<Fletcher::Model::Amazon alt="You Are Not a Gadget: A Manifesto (Vintage)" border="0" element="img" height="240" id="prodImage" onload="if (typeof uet == 'function') { if(typeof setCSMReq=='function'){setCSMReq('af');setCSMReq('cf');}else{uet('af');uet('cf');amznJQ.completedStage('amznJQ.AboveTheFold');} }" onmouseout="sitb_doHide('bookpopover'); return false;" onmouseover="sitb_showLayer('bookpopover'); return false;" src="http://ecx.images-amazon.com/images/I/51bpl1wA%2BaL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg" width="240">}

url = URI.extract(x)

puts url[2]

输出:

http://ecx.images-amazon.com/images/I/51bpl1wA%2BaL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg

希望这会有所帮助。我上周碰巧需要能够做到这一点,然后查了一下。

答案 1 :(得分:1)

在这种情况下,我相信它会是:fletchedProduct.image [:src]