从三个不同的div中刮取ID并将它们呈现为数组?

时间:2013-09-07 04:40:20

标签: ruby nokogiri

我有这样的事情:

<div id="sub_div">
    <span class="subl">
      <div class="node">2204830011</span>     
      <div class="node">1571827122</span>     
      <div class="node">...</span>    
      <div class="node">...</span>     
      <div class="node">...</span>      
    </span>
    <span class="subl">
      <div class="node">...</span>     
      <div class="node">...</span>     
      <div class="node">...</span>     
      <div class="node">...</span>     
      <div class="node">...</span>     
    </span>
    <span class="subl">
      <div class="node">...</span>     
      <div class="node">...</span>     
      <div class="node">...</span>     
   </span>

现在,我正在这样做:

  def self.parse_nodes

    id       = @data.at_css("#n_info #clipnode").text unless @data.at_css("#n_info #clipnode").nil?
    name     = @data.at_css("#n_info .node_name").text unless @data.at_css("#n_info .node_name").nil?
    parent   = @data.at_css(".bc a").text unless @data.at_css(".bc a").nil?

    children_array = []
    children = @data.css('#sub_div')
    children.css('.subl').each do | child |
      child_id = child.css('.node').text[/[\d,]+/].to_i
      children_array ||= []
      children_array << child_id
    end 

    nodes_hash = "id: #{id}, name: #{name}, parent: #{parent}, children: #{children_array}"
    nodes_hash
  end

我得到的是这样的东西:

[220483001115718271223064201115857511158575013463330111571879115709231157103512258019011157197311570657115706941,

220483001115718271223064201115857511158575013463330111571879115709231157103512258019011157197311570657115706941,
 220483001115718271223064201115857511158575013463330111571879115709231157103512258019011157197311570657115706941]

我不知道为什么代码会呈现所有.node三次。但无论如何,我想要做的是废弃每个.node div的.subl内的内容并将它们呈现为数组:

[2204830011, 1571827122, 3064201115, 8575111585, 7501346333,
0111571879, 1157092311, 5710351225, 8019011157, 1973115706,
57115706941]

直播网站:http://www.findbrowsenodes.com/us/Apparel/1036682

3 个答案:

答案 0 :(得分:1)

请尝试以下操作:

children = @data.css('#sub_div')
children_array = children.css('.subl .node').map { |node| node.text.to_i }

OR

children = @data.css('#sub_div')
children_array = children.css('.subl .node').map(&:text).map(&:to_i)

答案 1 :(得分:1)

您的代码产生以下输出:

require 'nokogiri'

html =<<END_OF_HTML
<div id="sub_div">
    <span class="subl">
      <div class="node">2204830011</div>
      <div class="node">1571827122</div>     
      <div class="node">...</div>    
      <div class="node">...</div>     
      <div class="node">...</div>      
    </span>

    <span class="subl">
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
    </span>
    <span class="subl">
      <div class="node">1</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
   </span>
</div>
END_OF_HTML

doc = Nokogiri::HTML(html)

children_array = []
children = doc.css('#sub_div')

children.css('.subl').each do | child |
  child_id = child.css('.node').text[/[\d,]+/].to_i
  children_array ||= []
  children_array << child_id
end 

p children_array

--output:--
[22048300111571827122, 0, 1]

你将数字连接在一起的原因是因为你写的时候:

child.css('.node')

...你得到一个NodeSet,它包含所有带有class =“node”的div。 text()方法从NodeSet中提取所有文本节点,并将所有文本连接在一起,没有空格:

require 'nokogiri'

html = "<div><span>hello</span><span>world</span></div>"
doc = Nokogiri::HTML(html)

spans = doc.css("span")
puts spans.text

--output:--
helloworld

所以当你写:

child.css('.node').text

...你会将许多数字连接成一个字符串。

以下是您可以做的事情:

require 'nokogiri'

html =<<END_OF_HTML
<div id="sub_div">
    <span class="subl">
      <div class="node">2204830011</div>
      <div class="node">1571827122</div>     
      <div class="node">...</div>    
      <div class="node">...</div>     
      <div class="node">...</div>      
    </span>

    <span class="subl">
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
    </span>
    <span class="subl">
      <div class="node">3333333</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
   </span>
</div>
END_OF_HTML


doc = Nokogiri::HTML(html)
results = []

doc.css("#sub_div span.subl div.node").each do |div|
  if num = div.text[/[\d,]+/] 
    results << num.to_i
  end
end

p results

--output:--
[2204830011, 1571827122, 3333333]

答案 2 :(得分:0)

这是另一种方法: -

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-eotl
<div id="sub_div">
    <span class="subl">
      <div class="node">2204830011</div>
      <div class="node">1571827122</div>     
      <div class="node">...</div>    
      <div class="node">...</div>     
      <div class="node">...</div>      
    </span>

    <span class="subl">
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
    </span>
    <span class="subl">
      <div class="node">3333333</div>     
      <div class="node">...</div>     
      <div class="node">...</div>     
   </span>
</div>
   eotl

doc.xpath("//div[@id='sub_div']//div[@class='node'][boolean(number()) or . = 0]").map{|n| n.text.to_i}
# => [2204830011, 1571827122, 3333333]