Question

我正在从两个站点抓取数据。首先刮擦其他，然后将价格重复两次。第二个站点抓取了正确的数据，但返回了一个间距问题，我不确定该如何解决。

<td>{% if condition %}Finalizado{% else %}Pendiente{% endif %}</td>

此返回：

class DailyDealz::Deal
attr_accessor :name, :price, :availability, :url

def self.today
 # Scrape woot and meh and then return deals based on that data
 self.scrape_deals
end

def self.scrape_deals
    deals = []

    deals << self.scrape_woot
    deals << self.scrape_meh
    # deals << self.scrape_steepandcheap

    deals
end

def self.scrape_woot
    doc = Nokogiri::HTML(open("https://www.woot.com/"))

    deal = self.new
    deal.name = doc.search("h2.main-title").text.strip
    deal.price = doc.search("#todays-deal span.price").text.strip
    deal.url = doc.search("a.wantone").first.attr("href").strip
    deal.availability = true
    deal.website 

    deal
end

def self.scrape_meh
    doc = Nokogiri::HTML(open("https://meh.com/"))

    deal = self.new
    deal.name = doc.search("section.features h2").text.strip
    deal.price = doc.search("#button.buy-button").text.gsub("Buy it.", "").strip
    deal.url = "https://meh.com/"
    deal.availability = true

    deal
end

如何删除woot中的重复定价和meh中的尴尬间距？

Answer 1

有两个问题：

#todays-deal span.price：三个元素符合此条件。让我们通过更改为
使其更加具体
```
#todays-deal .price-holder > span.price
```
选择price-holder div及其下的第一个span.price。
文本包含换行符。在gsub(/\s+/,' ')之后添加strip。

请参阅此example。

另一注：#button.buy-button正在寻找按钮ID，而不是“按钮”类型的元素。将其更改为button.buy-button。

Answer 2

请勿使用内核的open，该内核已被覆盖且已弃用：

warning: calling URI.open via Kernel#open is deprecated, call URI.open directly or use URI#open

代替使用URI.open：

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(URI.open('https://www.woot.com/'))

永远不要使用search ... text。 search返回一个NodeSet，如果集合中有多个节点，则text将连接输出，这是99.9％的时间使您感到抱歉。有关官方声明，请参见text的文档。详情请见下文。

如果知道要第一个或唯一一个匹配的节点，请使用at。而且，通常，如果您在特定节点之后，则不需要strip：

doc.search("h2.main-title").text.strip # => "Apple Watch Blowout!"
doc.at('h2.main-title').text           # => "Apple Watch Blowout!"

search ... text在这里咬你。文本的串联创建了一个字符串，该字符串现在迫使您不得不跳过箍以弄清楚您所拥有的内容。在这种特殊情况下，拆分和重组将相当容易，但是如果文本中不包含“ $”和“-”，则将非常困难。这个特殊问题是我们经常问到的问题。

修复很简单，让map(&:text)遍历NodeSet中的Node，您将收到一个文本值数组。而且，同样，您以后可能不需要strip。

doc.search("#todays-deal span.price").text.strip  # => "$129.99–$279.99$129.99$279.99"
doc.search('#todays-deal span.price').map(&:text) # => ["$129.99–$279.99", "$129.99", "$279.99"]

同样，同样的情况适用。另外，Nokogiri使得使用哈希[]表示法可以轻松访问参数的值。它更短，更漂亮，更清晰：

doc.search("a.wantone").first.attr("href").strip # => "https://www.woot.com:443/plus/your-choice-apple-watch?ref=w_cnt_gw_dly_wobtn"
doc.at("a.wantone")['href'].strip                # => "https://www.woot.com:443/plus/your-choice-apple-watch?ref=w_cnt_gw_dly_wobtn"

对于第二个站点，问题类似：

doc = Nokogiri::HTML(URI.open('https://meh.com/'))

使用tr快速删除文本的行尾，然后squeeze和strip将清除剩余的空白：

doc.search('section.features h2').text.strip                         # => "12-For-Tuesday: Fun Putty 1.8oz Tins\r\n                                \r\n                                    - 12 for $19"
doc.at('section.features h2').text.tr("\r\n", '').squeeze(' ').strip # => "12-For-Tuesday: Fun Putty 1.8oz Tins - 12 for $19"

在HTML中没有使用gsub ...“ Buy”，这是在浪费CPU时间，并且，除非您尝试替换或剥离多次出现的字符串，否则请使用sub代替。更快。

doc.search('button.buy-button').text.gsub('Buy it.', '').strip     # => "Sold out\r\n                                        There are no more"
doc.at('button.buy-button').text.tr("\r\n", '').squeeze(' ').strip # => "Sold out There are no more"

使用Nokogiri从网站抓取时如何访问文本节点

2 个答案: