我是红宝石的新手,我正试图刮一张桌子并把它放到csv上。我想从这个网址中抓取表格:http://www.inc.com/inc5000/list/2015/
我需要记录表中的所有信息,这将是td class =“c1”到td class =“c8”。我的while循环不能正常工作,所以我无法自动化。
我会发布我目前的代码,但基本上没什么。
require 'watir'
require 'open-uri'
require 'net/http'
require 'csv'
require 'nokogiri'
b = Watir::Browser.new :firefox
b.goto 'http://www.inc.com/inc5000/list/2015/'
acount = 49
p = Nokogiri::HTML.parse(b.html)
company = p.css(css).text
company = []
puts css
#right > table > tbody > tr:nth-child(1) > td.c2
"#right > table > tbody > tr:nth-child(1) > td"
csscompany1 = ".cd2"
csscompany1 = ".cd"
css1 = "#right > table > tbody > tr:nth-child"
css2 = "(#count)"
css3 = " > td.c2"
while count != 49 do
acss = "#{css1}#{css2}#{css3}
company.push(p.css(acss).text)
count += 1
end
答案 0 :(得分:0)
看起来你过度指定了CSS选择器:
td
个c1
个元素到c8
,但这就是行中的所有单元格。脚本可以通过以下方式大大简化:
data_row
。这将处理忽略空行。应用这些原则:
require 'watir'
require 'nokogiri'
b = Watir::Browser.new :firefox
b.goto 'http://www.inc.com/inc5000/list/2015/'
p = Nokogiri::HTML.parse(b.html)
# Get the relevant rows
data_rows = p.css('tr.data_row')
# Iterate over each row
data = data_rows.map { |data_row|
# For each row, collect the text of each td element
data_row.css('td').map(&:text)
}
# data will be 2D matrix of the table
data
#=> [
#=> ["1", "Ultra Mobile", "100,849%", "$118.2m", "Telecommunications", "California", "Los Angeles", "105"],
#=> ["2", "TRYFACTA", "28,365%", "$34.4m", "IT Services", "California", "San Francisco", "221"],
#=> etc.
#=> ]