Question

我是红宝石的新手，我正试图刮一张桌子并把它放到csv上。我想从这个网址中抓取表格：http://www.inc.com/inc5000/list/2015/

我需要记录表中的所有信息，这将是td class =“c1”到td class =“c8”。我的while循环不能正常工作，所以我无法自动化。

我会发布我目前的代码，但基本上没什么。

require 'watir'
require 'open-uri'
require 'net/http'
require 'csv'
require 'nokogiri'

b = Watir::Browser.new :firefox
b.goto 'http://www.inc.com/inc5000/list/2015/'
acount = 49
p = Nokogiri::HTML.parse(b.html)
company = p.css(css).text
company = []
puts css
#right > table > tbody > tr:nth-child(1) > td.c2
"#right > table > tbody > tr:nth-child(1) > td"
csscompany1 = ".cd2"
csscompany1 = ".cd"
css1 = "#right > table > tbody > tr:nth-child"
css2 = "(#count)"
css3 = " > td.c2"
while count != 49 do
acss = "#{css1}#{css2}#{css3}
company.push(p.css(acss).text)
count += 1
end

Answer 1

看起来你过度指定了CSS选择器：

您想要td个c1个元素到c8，但这就是行中的所有单元格。
该脚本正在迭代超过50行的表，但该表实际上有更多行。每10行有一行空白。

脚本可以通过以下方式大大简化：

注意到您关注的行的类别为data_row。这将处理忽略空行。
使用Ruby的内置可枚举方法，迭代集合而无需担心索引。

应用这些原则：

require 'watir'
require 'nokogiri'

b = Watir::Browser.new :firefox
b.goto 'http://www.inc.com/inc5000/list/2015/'
p = Nokogiri::HTML.parse(b.html)

# Get the relevant rows
data_rows = p.css('tr.data_row')

# Iterate over each row
data = data_rows.map { |data_row|
  # For each row, collect the text of each td element
  data_row.css('td').map(&:text)
}

# data will be 2D matrix of the table
data
#=> [
#=>    ["1", "Ultra Mobile", "100,849%", "$118.2m", "Telecommunications", "California", "Los Angeles", "105"],
#=>    ["2", "TRYFACTA", "28,365%", "$34.4m", "IT Services", "California", "San Francisco", "221"],
#=>    etc.
#=> ]

试图通过Watir和Nokogiri刮桌子

1 个答案: