如何修理nokogiri(雅虎)表刮刀?

时间:2015-02-26 20:09:25

标签: ruby csv xpath

18个月前,我们使用ruby和nokogiri输出到csv文件制作了一个小桌面刮刀。对页面结构的更改使得输出不是最佳的。以下是我们使用的简化版本:

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'

url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
doc = Nokogiri::HTML(open(url))
csv = CSV.open("output.csv", 'w')
doc.xpath('//table//tr').each do |row|
tarray = [] #temporary array
row.xpath('td').each do |cell|
    tarray << cell.text #Build array of that row of data.
end
csv << tarray #Write that row out to csv file
#puts "#{row}"
end

csv.close

当前输出:

&#34; ^ M

^ M

^ M

✕^ M

[修改] ^ M

                    ^M

                "

&#34; ^ M

        50.00^M

    ","^M

        FISV150320C00050000^M

    ","^M

        19.70^M

毋庸置疑,此类输出难以使用。

在尝试了xpath和csv库的很多组合之后,终于意识到是时候寻求帮助了。

鉴于以下代码段,不包括csv:

#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'
url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
#url = "http://finance.yahoo.com/q/op?s=FISV&date=1434672000"#jun
doc = Nokogiri::HTML(open(url))

doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s)   {2,}/m, '\1'), "\", "
end
print "\n"
end

生成类似于:

的输出
" 50.00 ", " FISV150320C00050000 ", " 19.70 ", " 26.90 ", " 30.50 ", " 0.00 ", " 0.00% ", " 5 ", " 0 ", " 83.20% ", 

需要在顶部(输出到csv)版本中进行哪些更改才能使其更好地工作?

1 个答案:

答案 0 :(得分:0)

假设你要转储来自&#34; Calls&#34;和&#34; Puts&#34;表格CSV你可以这样做:

require 'csv'
require 'nokogiri'
require 'open-uri'

def options_to_csv(url)
  CSV.generate do |csv|
    doc = Nokogiri::HTML(open(url))
    doc.xpath('//tr[@data-row]').each do |tr|
      csv << tr.xpath('td').map { |td| td.text.strip }
    end
  end
end

url = 'http://finance.yahoo.com/q/op?s=FISV&date=1426809600'
options_to_csv(url) # =>
# 50.00,FISV150320C00050000,19.70,26.90,29.00,0.00,0.00%,5,0,110.06%
# 55.00,FISV150320C00055000,11.91,22.00,24.00,0.00,0.00%,21,21,90.33%
# 60.00,FISV150320C00060000,17.48,18.30,19.00,0.00,0.00%,5,22,71.97%
# 65.00,FISV150320C00065000,10.70,13.30,14.00,0.00,0.00%,26,85,54.49%
# 70.00,FISV150320C00070000,8.90,8.40,8.90,0.00,0.00%,1,504,34.42%
# 75.00,FISV150320C00075000,3.80,3.70,4.10,0.00,0.00%,1,318,22.07%
# 80.00,FISV150320C00080000,0.55,0.45,0.60,0.00,0.00%,24,1435,14.55%
# 50.00,FISV150320P00050000,0.55,0.00,0.15,0.00,0.00%,6,10,83.98%
# 55.00,FISV150320P00055000,0.05,0.00,0.15,0.00,0.00%,3,14,68.16%
# 60.00,FISV150320P00060000,0.15,0.00,0.20,0.00,0.00%,1,84,56.06%
# 65.00,FISV150320P00065000,0.20,0.00,0.20,0.00,0.00%,3,166,47.56%
# 70.00,FISV150320P00070000,0.10,0.00,0.20,0.00,0.00%,14,472,32.13%
# 75.00,FISV150320P00075000,0.20,0.15,0.30,0.00,0.00%,42,557,18.80%
# 80.00,FISV150320P00080000,1.60,1.75,2.00,0.00,0.00%,22,91,15.06%

请注意,这些表还有id&#34; optionsCallsTable&#34;和&#34; optionsPutsTable&#34;,因此您可以使用该信息轻松分隔行。