如何在两个CSV文件中找到类似的行?

时间:2012-02-01 19:07:56

标签: ruby-on-rails ruby csv

这是我的代码,但对于大型文件需要永远:

require 'rubygems'
require "faster_csv"

fname1 =ARGV[0]
fname2 =ARGV[1]
if ARGV.size!=2
    puts "Display common lines in the two files \n Usage : ruby user_in_both_files.rb <file1> <file2> "
    exit 0
end

puts "loading the CSV files ..."
file1=FasterCSV.read(fname1, :headers => :first_row)
file2=FasterCSV.read(fname2, :headers => :first_row)
puts "CSV files loaded"

#puts file2[219808].to_s.strip.gsub(/\s+/,'')

lineN1=0
lineN2=0
# count how many common lines
similarLines=0
file1.each do |line1|
    lineN1=lineN1+1
    #compare line 1 to all line from file 2
    lineN2=0
    file2.each do |line2|
        puts "file1:l#{lineN1}|file2:l#{lineN2}"
        lineN2=lineN2+1
        if ( line1.to_s.strip.gsub(/\s+/,'') == line2.to_s.strip.gsub(/\s+/,'') ) 
            puts "file1:l#{line1}|file2:l#{line2}->#{line1}\n"
            similarLines=similarLines+1
        end
    end 
end
puts "#{similarLines} similar lines."

2 个答案:

答案 0 :(得分:2)

Ruby已经为数组设置了可用的操作:

a_ary = [1,2,3]
b_ary = [3,4,5]
a_ary & b_ary # => 3

所以,你应该尝试:

puts "loading the CSV files ..."
file1 = FasterCSV.read(fname1, :headers => :first_row)
file2 = FasterCSV.read(fname2, :headers => :first_row)
puts "CSV files loaded"

common_lines = file1 & file2
puts common_lines.size

如果需要预处理数组,请在加载数组时执行:

file1 = FasterCSV.read(fname1, :headers => :first_row).map{ |l| l.to_s.strip.gsub(/\s+/, '') }
file2 = FasterCSV.read(fname2, :headers => :first_row).map{ |l| l.to_s.strip.gsub(/\s+/, '') }

答案 1 :(得分:1)

每次循环遍历File1时,gsub都会file1lines = [] file1.each do |line1| file1lines = line1.strip.gsub(/\s+/, '') end # Do the same for `file2lines` file1lines.each do |line1| lineN1=lineN1+1 #compare line 1 to all line from file 2 lineN2=0 file2lines.each do |line2| puts "file1:l#{lineN1}|file2:l#{lineN2}" lineN2=lineN2+1 if ( line1 == line2 ) puts "file1:l#{line1}|file2:l#{line2}->#{line1}\n" similarLines=similarLines+1 end end end 。我先做,然后再比较一下。

修改这样的事情(未经测试)

puts

除非你真的需要它,否则我也会摆脱循环中的所有{{1}} es。如果文件很大,那可能会减慢它的速度。