Question

我绝对是红宝石的新手（并使用1.9.1），所以任何帮助都表示赞赏。我所学到的关于Ruby的一切都来自谷歌。我正在尝试比较两个哈希数组，并且由于它们的大小，它已经让位于长时间并且因内存耗尽而调情。任何帮助将不胜感激。

我有一个Class（ParseCSV），有多种方法（初始化，打开，比较，剥离，输出）。我现在使用它的方式如下（这确实通过了我编写的测试，只使用了更小的数据集）：


file1 = ParseCSV.new(“some_file”)
file2 = ParseCSV.new(“some_other_file”)

file1.open #this reads the file contents into an Array of Hash’s through the CSV library 
file1.strip #This is just removing extra hash’s from each array index.  So normally there are fifty hash’s in each array index, this is just done to help reduce memory consumption.  

file2.open 
file2.compare(“file1.storage”) #@storage is The array of hash’s from the open method

file2.output

现在我正在努力的是比较方法。处理较小的数据集根本不是什么大问题，工作得足够快。但是在这种情况下，我将大约400,000条记录（全部读入哈希数组）与具有大约450,000条记录的记录进行比较。我正试着加快速度。另外，我无法在file2上运行strip方法。我现在就是这样做的：


def compare(x)
    #obviously just a verbose message
    puts "Comparing and leaving behind non matching entries"

    x.each do |row|
        #@storage is the array of hashes
        @storage.each_index do |y|       
            if row[@opts[:field]] == @storage[y][@opts[:field]]
               @storage.delete_at(y)
            end
       end
    end
end

希望这是有道理的。我知道这将是一个缓慢的过程，因为它必须每次迭代400,000行440,000次。但是你对如何加速并可能减少内存消耗有任何其他想法吗？

Answer 1

Yikes，这将是O（n）平方运行时间。讨厌。

更好的选择是使用内置的Set类。

代码看起来像：

require 'set'

file1_content = load_file_content_into_array_here("some_file")
file2_content = load_file_content_into_array_here("some_other_file")

file1_set = Set[file1_content]

unique_elements = file1_set - file2_content

假设文件本身具有唯一内容。应该在通用情况下工作，但可能有怪癖，具体取决于您的数据是什么样的以及如何解析它，但只要这些行可以与==进行比较，它就可以帮助您。

使用集合比执行嵌套循环迭代文件内容要快得多。

（是的，我实际上这样做是为了处理大约200万行的文件，所以它应该能够处理你的情况 - 最终。如果你正在进行繁重的数据调整，Ruby可能不是最好的选择工具虽然）

Answer 2

这是一个比较两种方法的脚本：原始的compare（）和new_compare（）。 new_compare使用更多内置的Enumerable方法。由于它们是用C实现的，因此它们会更快。

我创建了一个名为Test :: SIZE的常量来尝试使用不同散列大小的基准测试。结果在底部。差异很大。

require 'benchmark'

class Test
  SIZE = 20000
  attr_accessor :storage
  def initialize
    file1 = []
    SIZE.times { |x| file1 << { :field => x, :foo => x } }
    @storage = file1
    @opts = {}
    @opts[:field] = :field
  end

  def compare(x)
    x.each do |row|
      @storage.each_index do |y|
        if row[@opts[:field]] == @storage[y][@opts[:field]]
          @storage.delete_at(y)
        end
      end
    end
  end

  def new_compare(other)
    other_keys = other.map { |x| x[@opts[:field]] }
    @storage.reject! { |s| other_keys.include? s[@opts[:field]] }
  end

end

storage2 = []
# We'll make 10 of them match
10.times { |x| storage2 << { :field => x, :foo => x } }
# And the rest wont
(Test::SIZE-10).times { |x| storage2 << { :field => x+100000000, :foo => x} }

Benchmark.bm do |b|
  b.report("original compare") do
    t1 = Test.new
    t1.compare(storage2)
  end
end

Benchmark.bm do |b|
  b.report("new compare") do
    t1 = Test.new
    t1.new_compare(storage2)
  end
end

结果：

Test::SIZE = 500
      user     system      total        real
original compare  0.280000   0.000000   0.280000 (  0.285366)
      user     system      total        real
new compare  0.020000   0.000000   0.020000 (  0.020458)

Test::SIZE = 1000
     user     system      total        real
original compare 28.140000   0.110000  28.250000 ( 28.618907)
      user     system      total        real
new compare  1.930000   0.010000   1.940000 (  1.956868)

Test::SIZE = 5000
ruby test.rb
      user     system      total        real
original compare113.100000   0.440000 113.540000 (115.041267)
      user     system      total        real
new compare  7.680000   0.020000   7.700000 (  7.739120)

Test::SIZE = 10000
      user     system      total        real
original compare453.320000   1.760000 455.080000 (460.549246)
      user     system      total        real
new compare 30.840000   0.110000  30.950000 ( 31.226218)

Ruby：比较两个哈希数组

2 个答案: