Question

我写了一个快速的Python脚本来比较两个文件，每个文件包含无序的哈希值，以验证两个文件除了顺序之外是否相同。然后我用Ruby重写了它以用于教育目的。

Python实现需要几秒钟，而Ruby实现大约需要4分钟。

我有一种感觉，这很可能是由于我缺乏Ruby知识，对我做错了什么想法？

环境是Windows XP x64，Python 2.6，Ruby 1.8.6

的Python

f = open('c:\\file1.txt', 'r')

hashes = dict()

for line in f.readlines():
    if not line in hashes:
        hashes[line] = 1
    else:
        hashes[line] += 1


print "Done file 1"

f.close()

f = open('c:\\file2.txt', 'r')

for line in f.readlines():
    if not line in hashes:
        print "Hash not found!"
    else:
        hashes[line] -= 1

f.close()

print "Done file 2"

num_errors = 0

for key in hashes.keys():
    if hashes[key] != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

红宝石

file = File.open("c:\\file1.txt", "r")
hashes = {}

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] += 1
  else
    hashes[line] = 1
  end
}

file.close()

puts "Done file 1"

file = File.open("c:\\file2.txt", "r")

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] -= 1
  else
    puts "Hash not found!"
  end
}

file.close()

puts "Done file 2"

num_errors = 0
hashes.each_key{ |key|
  if hashes[key] != 0
    num_errors += 1
  end
}

puts "Total of #{num_errors} mismatches found"

编辑为了了解规模，每个文件都很大，超过90万个哈希值。

进步

使用nathanvda的建议，这是优化的ruby脚本：

f1 = "c:\\file1.txt"
f2 = "c:\\file2.txt"

hashes = Hash.new(0)

File.open(f1, "r") do |f|
  while line = f.gets
    hashes[line] += 1
  end
end  

not_founds = 0

File.open(f2, "r") do |f|
  while line = f.gets
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

num_errors = hashes.values.to_a.select { |z| z != 0}.size   

puts "Total of #{not_founds} lines not found in file2"
puts "Total of #{num_errors} mismatches found"

在使用Ruby 1.8.7的Windows上，原始版本耗时250秒，优化版本耗时223秒。

在Linux VM上！运行ruby 1.9.1，原始版本在81秒内运行，大约是Windows 1.8.7的1/3。有趣的是，优化版本需要89秒才能更长时间。请注意，由于内存限制，line = ...是必要的。

在使用Ruby 1.9.1的Windows上，原始版本需要457秒，优化版本需要543秒。

在使用jRuby的Windows上，原始版本需要45秒，优化版本需要43秒。

我对结果感到有些惊讶，我期待1.9.1比1.8.7更好。

Answer 1

可能是因为Python中的dicts比Ruby中的哈希快得多

我刚刚运行了一个快速测试，在Ruby1.8.7中构建一个12345678项的哈希值是Python的3倍。 Ruby1.9大约是Python的两倍。

以下是我测试的方法 python

$ time python -c "d={}
for i in xrange(12345678):d[i]=1"

<强>红宝石

$ time ruby -e "d={};12345678.times{|i|d[i]=1}"

但不足以说明您的差异。

也许文件I / O值得研究 - 注释掉所有哈希代码并查看空循环在文件上运行多长时间。

这是Python中使用defaultdict和上下文管理器的另一个版本

from collections import defaultdict
hashes = defaultdict(int)

with open('c:\\file1.txt', 'r') as f:
    for line in f:
        hashes[line] += 1

print "Done file 1"

with open('c:\\file2.txt', 'r') as f:
    for line in f:
        if line in hashes:
            hashes[line] -= 1
        else:
            print "Hash not found!"

print "Done file 2"

num_errors = 0
for key,value in hashes.items():  # hashes.iteritems() might be better here
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

Answer 2

我发现Ruby的参考实现（好吧，Ruby）是（非科学地说）狗慢。

如果有机会，请尝试在JRuby下运行您的程序！查尔斯·纳特和其他太阳人声称自己大幅加速了Ruby。

我对你的结果最感兴趣。

Answer 3

在python方面，你可以迭代这样的字典项：

for key, value in hashes.iteritems():
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

此外：

for line in f.readlines():
    hashes[line] = hashes.setdefault(line, 0) + 1

...但我不能帮助你使用Ruby方面，除了建议你追捕一个分析器。

Answer 4

我不是Ruby专家，如果我错了，请有人纠正我：

我看到了一个很小的优化潜力。

如果你说

hashes = hash.new(0)

然后对未定义散列的引用将返回0并存储该键;你可以做到

hashes[line] += 1

每次都没有封闭的if和else。

警告：未经测试！

如果存储密钥不是自动发生的，那么还有另一个使用块的哈希构造函数，你可以明确地执行它。

Answer 5

Python的词典非常快。请参阅How are Python's Built In Dictionaries Implemented也许Ruby不是那么崩溃。

我怀疑它是哈希函数。 Ruby开发人员无法使用比Python更糟糕的哈希函数。

也许Ruby 1.8在动态调整大型哈希表的速度方面很慢？如何使用较小的文件扩展您的问题？

Answer 6

我能够加速你的ruby代码，如下所示：

require 'benchmark'

Benchmark.bm(10) do |x|

  x.report("original version") do
    file = File.open("c:\\file1.txt", "r")
    hashes = {}

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] += 1
      else
        hashes[line] = 1
      end
    }

    file.close()

    #puts "Done file 1"

    not_founds = 0

    file = File.open("c:\\file2.txt", "r")

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] -= 1
      else
        not_founds += 1        
      end
    }

    file.close()

    #puts "Done file 2"

    num_errors = 0
    hashes.each_key{ |key|
      if hashes[key] != 0
        num_errors += 1
      end
    }

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end


  x.report("my speedup") do
    hashes = {}
    File.open("c:\\file1.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] += 1
        else
          hashes[line] = 1
        end
      }
    end  

    not_founds = 0

    File.open("c:\\file2.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] -= 1
        else
          not_founds += 1
        end
      }
    end

    num_errors = hashes.values.to_a.select { |z| z != 0}.size   

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end

end

所以我读了一个bug块中的文件，这在我的情况下有点快（我在Windows XP上测试过，ruby 1.8.6和100000行的文件）。我基于所有不同的方式来读取文件（我可以想到），这是最快的方式。此外，我确实加快了哈希值中的值的计数，但这只有在你为非常大的数字做的时才可见：）

所以我在这里获得了非常小的速度提升。我机器上的输出如下：

                user     system      total        real
original versionTotal of 16 lines not found in file2
Total of 4 mismatches found
   1.000000   0.015000   1.015000 (  1.016000)
my speedup v1Total of 16 lines not found in file2
Total of 4 mismatches found
   0.812000   0.047000   0.859000 (  0.859000)

谁有任何想法可以进一步改善这一点？

如果f.readlines变慢，由于尺寸，我发现

File.open("c:\\file2.txt", "r") do |f|
  while (line=f.gets)
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

对我来说只是一点点。

我正在考虑改善

的方法

if hashes.has_key?(line) ...

编码，但无法想到任何事情。

您是否尝试过使用Ruby 1.9？

我有一个带有Ruby 1.9.1的Windows 7虚拟机，f.readlines速度较慢，我需要使用while (line=f.gets)，因为内存有限：）

由于Ruby用户主要在Unix相关平台上进行测试，我想这可以解释为什么代码在Windows上是次优的。有人在Unix上比较了上面提到的性能吗？这是一个ruby与python问题，还是Ruby-windows与Ruby-Unix？

Answer 7

我敢打赌Ruby 1.9.x的结果在大多数地区更快或与Python相提并论，是由哈希/词典实现所需的额外开销引起的，因为有序 Ruby与Python相反。

Answer 8

我会尝试在我丰富的空闲时间做基准测试，但尝试使用group_by。它不仅更像功能编程，而且我发现它比MRI中的程序版本快得多。

def convert_to_hash(file)
  values_hash = file.each_line.group_by {|line| line}
  # Hash.[] converts an array of pairs into a hash
  count_hash = Hash[ values_hash.map{|line, lines| [line, lines.length]}]
  count_hash
end

hash1 = convert_to_hash(file)
hash2 = convert_to_hash(file2)
# compare if the two hashes are equal

执行速度差异的原因是什么？

8 个答案: