Question

我正在尝试将一个大文本文件（大约200万行数字，260MB）导入数组，对数组进行编辑，然后将结果写入新文本文件，写成：

file_data = File.readlines("massive_file.txt")
file_data = file_data.map!(&:strip)
file_data.each do |s|
    s.gsub!(/,.*\z/, "")
end
File.open("smaller_file.txt", 'w') do |f|
    f.write(file_data.map(&:strip).uniq.join("\n"))
end

但是，我收到了错误failed to allocate memory (NoMemoryError)。如何分配更多内存来完成任务？或者，理想情况下，我可以使用另一种方法来避免重新分配内存吗？

Answer 1

您可以逐行阅读文件：

require 'set'
require 'digest/md5'
file_data = File.new('massive_file.txt', 'r')
file_output = File.new('smaller_file.txt', 'w')
unique_lines_set = Set.new

while (line = file_data.gets)
    line.strip!
    line.gsub!(/,.*\z/, "")
    # Check if the line is unique
    line_hash = Digest::MD5.hexdigest(line)
    if not unique_lines_set.include? line_hash
      # It is unique so add its hash to the set
      unique_lines_set.add(line_hash)

      # Write the line in the output file
      file_output.puts(line)
    end
end

file_data.close
file_output.close

Answer 2

您可以尝试一次读写一行：

new_file = File.open('smaller_file.txt', 'w')
File.open('massive_file.txt', 'r') do |file|
  file.each_line do |line|
    new_file.puts line.strip.gsub(/,.*\z/, "")
  end
end
new_file.close

唯一待定的是找到重复的行

Answer 3

或者，您可以读取文件块，与逐行读取文件相比应该更快：

FILENAME="massive_file.txt"
MEGABYTE = 1024*1024

class File
  def each_chunk(chunk_size=MEGABYTE) # or n*MEGABYTE
    yield read(chunk_size) until eof?
  end
end

filedata = ""
open(FILENAME, "rb") do |f|
  f.each_chunk() {|chunk|
      chunk.gsub!(/,.*\z/, "")
      filedata += chunk
  }
end

参考：https://stackoverflow.com/a/1682400/3035830

＆＃39;无法分配内存＆＃39;大数组错误

3 个答案: