使用ruby / fastercsv在公共字段上合并CSV文件

时间:2011-10-30 18:50:54

标签: ruby fastercsv

我有一个包含多个列的'master'文件:1 2 3 4 5.我有一些其他文件,行数比主文件少,每个文件都有列:1 6.我想要合并这些文件在第1列字段上匹配,并将第6列添加到主服务器。我见过一些python / UNIX解决方案,但如果它很合适,我更喜欢使用ruby / fastercsv。我将不胜感激任何帮助。

3 个答案:

答案 0 :(得分:2)

FasterCSV现在是Ruby 1.9中的默认CSV实现。此代码未经测试,但应该有效。

require 'csv'
master = CSV.read('master.csv') # Reads in master
master.each {|each| each.push('')} # Adds another column to all rows
Dir.glob('*.csv').each do |each| #Goes thru all csv files
  next if each == 'master.csv' # skips the master csv file
  file = CSV.read(each) # Reads in each one
  file.each do |line| #Goes thru each line of the file
    temp = master.assoc(line[0]) # Finds the appropriate line in master
    temp[-1] = line[1] if temp #updates last column if line is found
  end
end

csv = CSV.open('output.csv','wb') #opens output csv file for writing
master.each {|each| csv << each} #Goes thru modified master and saves it to file

答案 1 :(得分:1)

$ cat j4.csv
how, now, brown, cow, f1
now, is, the, time, f2
one, two, three, four, five
xhow, now, brown, cow, f1
xnow, is, the, time, f2
xone, two, three, four, five
$ cat j4a.csv
how, b
one, d
$ cat hj.rb
require 'pp'
require 'rubygems'
require 'fastercsv'

pp(
  FasterCSV.read('j4a.csv').inject(
    FasterCSV.read('j4.csv').inject({}) do |m, e|
      m[e[0]] = e
      m
    end) do |m, e|
    k = e[0]
    m[k] << e.last if m[k]
    m
  end.values)
$ ruby hj.rb
[["now", " is", " the", " time", " f2"],
 ["xhow", " now", " brown", " cow", " f1"],
 ["xone", " two", " three", " four", " five"],
 ["how", " now", " brown", " cow", " f1", " b"],
 ["one", " two", " three", " four", " five", " d"],
 ["xnow", " is", " the", " time", " f2"]]

这可以通过将主文件映射到第一列作为键的哈希,然后它只是从其他文件中查找键。如上所述,代码在键匹配时附加最后一列。由于您有多个非主文件,因此您可以通过将FasterCSV.read('j4a.csv')替换为读取每个文件的方法并将它们连接成一个数组的数组来调整概念,或者您可以保存结果来自内部inject(主哈希)并在循环中将每个其他文件应用于它。

答案 2 :(得分:0)

temp = master.assoc(line[0]) 

以上是一个非常慢的过程。整个复数至少为O(n ^ 2)。

我将使用以下过程:

  1. 对于1 6个csv,将其转换为以1作为键和6作为值的大哈希, 命名为1_to_6_hash
  2. 逐行循环1 2 3 4 5 csv,设置row [6] = 1_to_6_hash [row [1]]

它将大大降低复数为O(n)