Question

我有这个代码试图转到一个URL并将'li'元素解析成一个数组。但是，在尝试解析不在“b”标记中的任何内容时，我遇到了一个问题。

代码：

url = '(some URL)'
page = Nokogiri::HTML(open(url))
csv = CSV.open("/tmp/output.csv", 'w')

page.search('//li[not(@id) and not(@class)]').each do |row|
  arr = []
  row.search('b').each do |cell|
    arr << cell.text
  end
  csv << arr
  pp arr
end

HTML：

<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>

我想解析所有元素，以便输出如下：

["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"]

Answer 1

require 'nokogiri'

def main
  output = []
  page = File.open("parse.html") {|f| Nokogiri::HTML(f)}
  page.search("//li[not(@id) and not (@class)]").each do |row|
    arr = []
    result = row.text
    result.each_line { |l|
      if l.strip.length > 0
        arr << l.strip
      end
    }
    output << arr
  end
  print output
end

if __FILE__ == $PROGRAM_NAME
  main()
end

Answer 2

我最终找到了自己问题的解决方案，所以如果有人有兴趣，我只需更改

row.search('b').each do |cell|

成：

row.search('text()'.each do |cell|

我也改变了

arr << cell.text

成：

arr << cell.text.gsub("\n", '').gsub("\r", '')

以删除输出中存在的所有\ n和\ r。

Answer 3

根据您的HTML，我会这样做：

{{1}}

Nokogiri解析表没有html元素

3 个答案: