基准时间

Question

寻找关于如何使用ruby比较两个大文本文件的方向。任何帮助表示赞赏。一次只有100行。

尝试：

file(file1).foreach.each_slice(100) do |lines|
  pp lines
end

混淆了如何将第二个文件包含在此循环中。

Answer 1

CHUNK_SIZE = 256 # bytes
def same? path1, path2
  return false unless [path1, path2].map { |f| File.size f }.reduce &:==

  f1, f2 = [path1, path2].map { |f| File.new f }

  loop do
    s1, s2 = [f1, f2].map { |f| f.read(CHUNK_SIZE) }
    break false if s1 != s2
    break true if s1.nil? || s1.length < CHUNK_SIZE
  end
ensure
  [f1, f2].each &:close
end

UPD ：固定拼写错误和文件大小比较的信用额转到@tadman。

Answer 2

只需“Process two files at the same time in Ruby”并按块进行比较，如下所示：

f1 = File.open('file1.txt', 'r')
f2 = File.open('file2.txt', 'r')

f1.each_slice(10).zip(f2.each_slice(10)).each do |line1, line2|
  return false unless line1 == line2
end
return true

或者，正如@meagar所建议的那样（在这种情况下逐行）：

f1.each_line.zip(f2.each_line).all? { |a,b| a == b }

如果文件相同，这将返回true。

Answer 3

只需逐行比较这些文件：

def same_file?(path1, path2)
  file1 = File.open(path1)
  file2 = File.open(path2)
  return true if File.absolute_path(path1) == File.absolute_path(path2)
  return false unless file1.size == file2.size
  enum1 = file1.each
  enum2 = file2.each

  loop do
    # It's a mystery that the loop really ends
    # when any of the 2 files has nothing to read
    return false unless enum1.next == enum2.next
  end

  return true
ensure
  file1.close
  file2.close
end

我完成了我的作业并在Kernel#loop文档中找到了：

块中出现的
StopIteration打破了循环。在这种情况下，循环返回存储在异常中的“结果”值。

并且，在Enumerator#next文档中：

当结束时到达位置时，StopIteration被提升。

所以这个谜对我来说不再是一个谜。

Answer 4

这是另一个，方法类似于mudasobwa's answer：

def same?(file_1, file_2)
  return true if File.identical?(file_1, file_2)
  return false unless File.size(file_1) == File.size(file_2)

  buf_size = 2 ** 15 # 32 K
  buf_1 = ''
  buf_2 = ''

  File.open(file_1) do |f1|
    File.open(file_2) do |f2|
      while f1.read(buf_size, buf_1) && f2.read(buf_size, buf_2)
        return false unless buf_1 == buf_2
      end
    end
  end
  true
end

在前两行中，使用File.identical?和File.size执行相同文件（例如硬链接和软链接）以及相同尺寸的快速检查。

File.open以只读模式打开每个文件。然后while循环不断调用read从每个文件读取32K块到缓冲区buf_1和buf_2直到EOF。如果缓冲区不同，则返回false。否则，即没有遇到任何差异，则返回true。

Answer 5

要确定两个文件是否具有完全相同的内容，而不比较每个文件的同一块的实际内容，可以使用校验和函数以确定的方式将数据转换为哈希字符串。虽然您必须读取内容以校验它，但您可以获得每个切片的校验和，并最终得到每个文件的校验和数组。

然后，您可以比较校验和的集合。如果这两个文件具有完全相同的内容，则两个集合将是相同的。

require 'digest/md5'

hashes1 = File.foreach('./path_to_file').each_slice(100).map do |slice|
  Digest::MD5.hexdigest(slice)
end
hashes2 = File.read('./path_to_duplicate').each_slice(100).map do |slice|
  Digest::MD5.hexdigest(slice)
end

hashes1.join == hashes2.join
#=> true, meaning the two files contain the same content

Answer 6

基准时间

（马特的答案不包括在内，因为我无法使其发挥作用）

结果1 KB文件大小（N = 10000）

                          user     system      total        real
aetherus              0.510000   0.300000   0.810000 (  0.823201)
meagar                0.350000   0.160000   0.510000 (  0.512755)
mudasobwa             0.290000   0.200000   0.490000 (  0.500831)
stefan                0.150000   0.160000   0.310000 (  0.312743)
yevgeniy_anfilofyev   0.320000   0.170000   0.490000 (  0.497157)

结果1 MB文件大小（N = 100）

                          user     system      total        real
aetherus              1.540000   0.110000   1.650000 (  1.667937)
meagar                1.170000   0.130000   1.300000 (  1.310278)
mudasobwa             1.470000   0.830000   2.300000 (  2.313481)
stefan                0.010000   0.030000   0.040000 (  0.045577)
yevgeniy_anfilofyev   0.570000   0.100000   0.670000 (  0.677226)

结果1 GB文件大小（N = 1）

                          user     system      total        real
aetherus             15.570000   0.920000  16.490000 ( 16.525826)
meagar               24.170000   1.910000  26.080000 ( 26.190057)
mudasobwa            16.260000   8.160000  24.420000 ( 24.471977)
stefan                0.120000   0.330000   0.450000 (  0.443074)
yevgeniy_anfilofyev  12.940000   1.310000  14.250000 ( 14.295736)

注释

使用较大的CHUNK_SIZE
具有相同的块大小，stefan的代码似乎比mudasobwa的代码快〜2倍
“最快”的块大小介于16 K和512 K之间
我无法使用fruity因为1 GB测试耗时太长

代码

def aetherus_same?(f1, f2)
  enum1 = f1.each
  enum2 = f2.each
  loop do
    return false unless enum1.next == enum2.next
  end
  return true
end

def meagar_same?(f1, f2)
  f1.each_line.zip(f2.each_line).all? { |a,b| a == b }
end

CHUNK_SIZE = 256 # bytes
def mudasobwa_same?(f1, f2)
  loop do
    s1, s2 = [f1, f2].map { |f| f.read(CHUNK_SIZE) }
    break false if s1 != s2
    break true if s1.nil? || s1.length < CHUNK_SIZE
  end
end

def stefan_same?(f1, f2)
  buf_size = 2 ** 15 # 32 K
  buf_1 = ''
  buf_2 = ''
  while f1.read(buf_size, buf_1) && f2.read(buf_size, buf_2)
    return false unless buf_1 == buf_2
  end
  true
end

def yevgeniy_anfilofyev_same?(f1, f2)
  f1.each_slice(10).zip(f2.each_slice(10)).each do |line1, line2|
    return false unless line1 == line2
  end
  return true
end

FILE1 = ARGV[0]
FILE2 = ARGV[1]
N     = ARGV[2].to_i

def with_files
  File.open(FILE1) { |f1| File.open(FILE2) { |f2| yield f1, f2 } }
end

require 'benchmark'

Benchmark.bm(19) do |x|
  x.report('aetherus')             { N.times { with_files { |f1, f2| aetherus_same?(f1, f2) } } }
  x.report('meagar')               { N.times { with_files { |f1, f2| meagar_same?(f1, f2) } } }
  x.report('mudasobwa')            { N.times { with_files { |f1, f2| mudasobwa_same?(f1, f2) } } }
  x.report('stefan')               { N.times { with_files { |f1, f2| stefan_same?(f1, f2) } } }
  x.report('yevgeniy_anfilofyev')  { N.times { with_files { |f1, f2| yevgeniy_anfilofyev_same?(f1, f2) } } }
end

Ruby chunk并比较两个大文件

6 个答案:

基准时间

结果1 KB文件大小（N = 10000）

结果1 MB文件大小（N = 100）

结果1 GB文件大小（N = 1）

注释

代码