如何在使用ruby / nokogiri进行html表解析后获取正确的值

时间:2011-09-21 09:13:43

标签: html ruby html-table nokogiri

我已经搜索并搜索了3天,试图让数据抓取工作,看起来我已经成功解析了HTML表格,如下所示:

<tr class='ds'>
<td class='ds'>Length:</td>
<td class='ds'>1/8"</td>
</tr>
<tr class='ds'>
<td class='ds'>Width:</td>
<td class='ds'>3/4"</td>
</tr>
<tr class='ds'>
<td class='ds'>Color:</td>
<td class='ds'>Red</td>
</tr>

但是,我似乎无法正确打印到csv。

Ruby代码如下:

Specifications = {
:length => ['Length:','length','Length'],       
:width => ['width:','width','Width','Width:'],  
:Color => ['Color:','color'], 
.......
}.freeze

def specifications
  @specifications ||= xml.css('tr.ds').map{|row| row.css('td.ds').map{|cell| cell.children.to_s } }.map{|record| 
  specification = Specifications.detect{|key, value| value.include? record.first } 
  [specification.to_s.titleize, record.last]  }
end 

csv打印成一列(似乎是完整的数组):

[["", nil], ["[:finishtype, [\"finish\", \"finish type:\", \"finish type\", \"finish type\", \"finish type:\"]]", "Metal"], ["", "1/4\""], ["[:length, [\"length:\", \"length\", \"length\"]]", "18\""], ["[:width, [\"width:\", \"width\", \"width\", \"width:\"]]", "1/2\""], ["[:styletype, [\"style:\", \"style\", \"style:\", \"style\"]]"........

我认为问题在于我没有指定要返回的值,但是当我尝试指定输出时,我没有成功。任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:0)

尝试更改

[specification.to_s.titleize, record.last]

[specification.last.first.titleize, record.last]

detect产生例如[:length, ["Length:", "length", "Length"]]这将成为 "[:length, [\"Length:\", \"length\", \"Length\"]]" to_s。使用last.first,您只需提取其"Length:"部分。

如果您遇到与Specification不匹配的属性,您可以通过更改为删除值:

    xml.css('tr.ds').map{|row| row.css('td.ds').map{|cell| cell.children.to_s } }.map{|record|  
      specification = Specifications.detect{|key, value| value.include? record.first }
      [specification.last.first.titleize, record.last] if specification 
    }.compact