'无法分配内存'大数组错误

时间:2015-01-21 13:16:58

标签: ruby arrays memory-management

我正在尝试将一个大文本文件(大约200万行数字,260MB)导入数组,对数组进行编辑,然后将结果写入新文本文件,写成:

file_data = File.readlines("massive_file.txt")
file_data = file_data.map!(&:strip)
file_data.each do |s|
    s.gsub!(/,.*\z/, "")
end
File.open("smaller_file.txt", 'w') do |f|
    f.write(file_data.map(&:strip).uniq.join("\n"))
end

但是,我收到了错误failed to allocate memory (NoMemoryError)。如何分配更多内存来完成任务?或者,理想情况下,我可以使用另一种方法来避免重新分配内存吗?

3 个答案:

答案 0 :(得分:2)

您可以逐行阅读文件:

require 'set'
require 'digest/md5'
file_data = File.new('massive_file.txt', 'r')
file_output = File.new('smaller_file.txt', 'w')
unique_lines_set = Set.new

while (line = file_data.gets)
    line.strip!
    line.gsub!(/,.*\z/, "")
    # Check if the line is unique
    line_hash = Digest::MD5.hexdigest(line)
    if not unique_lines_set.include? line_hash
      # It is unique so add its hash to the set
      unique_lines_set.add(line_hash)

      # Write the line in the output file
      file_output.puts(line)
    end
end

file_data.close
file_output.close

答案 1 :(得分:0)

您可以尝试一次读写一行:

new_file = File.open('smaller_file.txt', 'w')
File.open('massive_file.txt', 'r') do |file|
  file.each_line do |line|
    new_file.puts line.strip.gsub(/,.*\z/, "")
  end
end
new_file.close

唯一待定的是找到重复的行

答案 2 :(得分:-1)

或者,您可以读取文件块,与逐行读取文件相比应该更快:

FILENAME="massive_file.txt"
MEGABYTE = 1024*1024

class File
  def each_chunk(chunk_size=MEGABYTE) # or n*MEGABYTE
    yield read(chunk_size) until eof?
  end
end

filedata = ""
open(FILENAME, "rb") do |f|
  f.each_chunk() {|chunk|
      chunk.gsub!(/,.*\z/, "")
      filedata += chunk
  }
end

参考:https://stackoverflow.com/a/1682400/3035830