Question

您好我正在使用此代码 - 但对于文件＆gt; 8百万行 - 作为文本输入传递的2个文件，内存耗尽。我如何比较两个超过3000万行的文本文件。

fileA1 = ARGV[0]
fileA2 = ARGV[1]

if ARGV.length != 2
  raise 'Send Two files pls'
end
cmd = "sort #{fileA1} > Sorted.txt"
`#{cmd}`
aFile = File.open("Sorted.txt", "r");
bFile = File.open(fileA2, "r").readlines;
fileR = File.open("result.txt", "w")

p aFile.class
p bFile.class
p bFile.length

aFile.each do |e|
  if(! bFile.include?(e) )
    p 'Able to get differences:' + e.to_s
    fileR.write('Does not Include:' + e)
  end
end

额外的编码我试过没有运气。

counterA = counterB = 0
aFile = File.open("Sample1 - Copy.txt", "r");
bFile = File.open("Sample2.txt", "r");
file1lines = aFile.readlines
file2lines = bFile.readlines


file1lines.each do |e|
if(!file2lines.include?(e))
puts e
 else
p "Files include these lines:"
end
end
stopTime = Time.now

Answer 1

作为一个起点，我会使用diff Unix命令（在Windows上作为Cygwin的一部分提供，等等），看看是否满足您的需求：

#!/usr/bin/env ruby

raise "Syntax is comp_files file1 file2" unless ARGV.length == 2

file1, file2 = ARGV

`sort #{file1} > file1_sorted.txt`
`sort #{file2} > file2_sorted.txt`

`diff file1_sorted.txt file2_sorted.txt 2>&1 > diff.txt`
puts 'Created diff.txt.'  # After running the script, view it w/less, etc.

这是一个类似的脚本，它使用在退出之前自动删除的临时文件：

#!/usr/bin/env ruby

raise "Syntax is comp_files file1 file2" unless ARGV.length == 2

require 'tempfile'

input_file1, input_file2 = ARGV
sorted_file1 = Tempfile.new('comp_files_sorted_1').path
sorted_file2 = Tempfile.new('comp_files_sorted_2').path

puts [sorted_file1, sorted_file2]

`sort #{input_file1} > #{sorted_file1}`
`sort #{input_file2} > #{sorted_file2}`

`diff #{sorted_file1} #{sorted_file2} 2>&1 > diff.txt`
puts 'Created diff.txt.'  # After running the script, view it w/less, etc.

# The code below can be used to create sample input files
# File.write('input1.txt', (('a'..'j').to_a.shuffle + %w(s  y)).join("\n"))
# File.write('input2.txt', (('a'..'j').to_a.shuffle + %w(s  t  z)).join("\n"))

Answer 2

我相信您的问题在于readlines。此方法将读取整个文件并返回一个字符串。由于您的文件很大，因此存在内存不足的风险。

要处理大型文件，请不要立即阅读整个内容，而是根据需要阅读。

此外，您的算法还有另一个问题，因为比较确实会检查aFile中的所有行是否都包含在bFile中，而根本不检查订单。我不确定这是不是你的意图。

如果你真的想逐行比较，如果订单很重要，那么你的比较应该是逐行的，你不必将整个文件读成一个字符串。请改用gets方法，默认情况下会返回文件中的下一行或EOF中的nil。

这样的事情：

aFile.each do |e|
  if e != bFile.gets
    p 'Able to get differences:' + e.to_s
    fileR.write('Does not Include:' + e)
  end
end

另一方面，如果你真的想要找到a中的所有行都在b中，无论顺序如何，你都可以做一个嵌套循环，对于a中的每一行，你迭代b的所有行。确保在第一次匹配时返回快速的事情，因为这将是一个非常昂贵的操作，但include调用也很昂贵，因此除了文件IO开销之外，它可能是IMO的结合。

Answer 3

这是一个脚本，它将分析2个文本文件，报告第一个差异，或者行数或成功的差异。

注意：此处的代码已被截断。请转到https://gist.github.com/keithrbennett/1d043fdf7b685d9692f0181ad68c6307完整的脚本！

#!/usr/bin/env ruby

raise "Syntax is first_diff  file1  file2" unless ARGV.size == 2

FILE1, FILE2 = ARGV

ENUM1 = File.new(FILE1).to_enum
ENUM2 = File.new(FILE2).to_enum


def build_unequal_error_message(line_num, line1, line2)
"Difference found at line #{line_num}:
#{FILE1}: #{line1}
#{FILE2}: #{line2}"
end


def build_unequal_line_count_error_message(line_count, file_exhausted)
  "All lines up to line #{line_count} were identical, " \
  "but #{file_exhausted} has no more text lines."
end


def get_line(file_enumerator)
  file_enumerator.next.chomp
end


def has_next(enumerator)
  begin
    enumerator.peek
    true
  rescue StopIteration
    false
  end
end


# Returns an analysis of the results in the form of a string
# if a compare error occurred, else returns nil.
def error_text_or_nil

  line_num = 0

  loop do
    has1 = has_next(ENUM1)
    has2 = has_next(ENUM2)

    case
      when has1 && has2
        line1 = get_line(ENUM1)
        line2 = get_line(ENUM2)

        if line1 != line2
          return build_unequal_error_message(line_num, line1, line2)
        end
      when !has1 && !has2
        return nil  # if both have no more values, we're done
      else # only 1 enum has been exhausted
        exhausted_file = has1 ? FILE2 : FILE1
        not_exhausted_file = exhausted_file == FILE1 ? FILE2 : FILE1
        return build_unequal_line_count_error_message(line_num, exhausted_file)
    end

    line_num += 1
  end

  puts "Lines processed successfully: #{line_num}"
end


result = error_text_or_nil

if result
  puts result
  exit -1
else
  puts "Compare successful"
  exit 0
end

文件比较内存不足

3 个答案: