Question

到目前为止，我有这段代码可以读取文件并使用Ruby对其进行排序。但这并不能正确地对数字进行排序，而且我认为文件的大小可能高达200GB，并且每行都包含一个数字，因此效率不高。你能建议其他办法吗？

File.open("topN.txt", "w") do |file|
  File.readlines("N.txt").sort.reverse.each do |line|
    file.write(line.chomp<<"\n")
  end
End

每个人都在这里帮助之后，这就是到目前为止我的代码的样子...

begin

  puts "What is the file name?"
  file = gets.chomp

  puts "Whats is the N number?"
  myN = Integer(gets.chomp)

rescue ArgumentError

  puts "That's not a number, try again"
  retry
end

topN = File.open(file).each_line.max(myN){|a,b| a.to_i <=> b.to_i}
puts topN

Answer 1

假设

str = File.read(in_filename)
  #=> "117\n106\n143\n147\n63\n118\n146\n93\n"

您可以将该字符串转换为枚举行的枚举器，使用Enumerable#sort_by对这些行进行降序排序，将结果行（以换行符结尾）连接起来，以形成可以写入文件的字符串：

str.each_line.sort_by { |line| -line.to_i }.join
  #=> "147\n146\n143\n118\n117\n106\n93\n63\n"

另一种方法是将字符串转换为整数数组，使用Array#sort对数组进行排序，反转生成的数组，然后将数组的元素重新组合成可以写入文件的字符串：

str.each_line.map(&:to_i).sort.reverse.join("\n") << "\n"
  #=> "147\n146\n143\n118\n117\n106\n93\n63\n"

让我们做一个快速基准测试。

require 'benchmark/ips'

(str = 1_000_000.times.map { rand(10_000) }.join("\n") << "\n").size

Benchmark.ips do |x|
  x.report("sort_by") { str.each_line.sort_by { |line| -line.to_i }.join }
  x.report("sort")    { str.each_line.map(&:to_i).sort.reverse.join("\n") << "\n" }
  x.compare!
end

Comparison:
                sort:        0.4 i/s
             sort_by:        0.3 i/s - 1.30x  slower

强大的sort再次获胜！

Answer 2

在内存中排序200GB的数据性能不高。我会写一个小助手类，只记住到目前为止添加的N个最大元素。

class SortedList
  attr_reader :list

  def initialize(size)
    @list = []
    @size = size
  end

  def add(element)
    return if @min && @min > element

    list.push(element)
    reorganize_list
  end

  private

  def reorganize_list
    @list = list.sort.reverse.first(@size)
    @min = list.last
  end
end

使用require N初始化一个实例，然后将每行中解析的值添加到该实例中。

sorted_list = SortedList.new(n)

File.readlines("N.txt").each do |line|
  sorted_list.add(line.to_i)
end

puts sorted_list.list

Answer 3

Enumerable.max带有一个参数，该参数指定将返回多少元素，以及一个块，其指定如何比较元素：

N = 5
p File.open("test.txt").each_line.max(N){|a,b| a.to_i <=> b.to_i}

这不会读取内存中的整个文件；逐行读取文件。

Answer 4

您在问题上留下了此评论：

“写一个给定数字N的程序topN，并在每行上包含一个大数字的任意大文件（例如200Gb文件），将输出最大的N个数字，从高到低。”

在我看来，该问题与问题中所述的问题有所不同，并且也构成了一个更有趣的问题。我已经在这个答案中解决了这个问题。

代码

def topN(fname, n, m=n)
  raise ArgumentError, "m cannot be smaller than n" if m < n
  f = File.open(fname)
  best = Array.new(n)
  n.times do |i|
    break best.replace(best[0,i]) if f.eof?
    best[i] = f.readline.to_i
  end
  best.sort!.reverse!
  return best if f.eof?
  new_best = Array.new(n)
  cand = Array.new(m)
  until f.eof?
    rd(f, cand)
    merge_arrays(best, new_best, cand)
  end
  f.close
  best
end

def rd(f, cand)
  cand.size.times { |i| cand[i] = (f.eof? ? -Float::INFINITY : f.readline.to_i) }
  cand.sort!.reverse!
end

def merge_arrays(best, new_best, cand)
  cand_largest = cand.first
  best_idx = best.bsearch_index { |n| cand_largest > n }
  return if best_idx.nil?
  bi = best_idx
  cand_idx = 0
  nbr_to_compare = best.size-best_idx
  nbr_to_compare.times do |i|
    if cand[cand_idx] > best[bi]
      new_best[i] = cand[cand_idx]
      cand_idx += 1
    else 
      new_best[i] = best[bi]
      bi += 1
    end
  end
  best[best_idx..-1] = new_best[0, nbr_to_compare]
end

示例

让我们创建一个包含1000万个整数表示的文件，每行一个。

require 'time'

FName = 'test'

(s = 10_000_000.times.with_object('') { |_,s| s << rand(100_000_000).to_s << "\n" }).size
s[0,27]
  #=> "86752031\n84524374\n29347072\n"
File.write(FName, s)
  #=> 88_888_701

接下来，创建一个简单的方法来调用带有不同参数的topN并显示执行时间。

def try_one(n, m=n)
  t = Time.now
  a = topN(FName, n, m)
  puts "#{(Time.new-t).round(2)} seconds"
  puts "top 5: #{a.first(5)}"
  puts "bot 5: #{a[n-5..n-1]}"
end

在测试中，我发现将m设置为小于n并不是计算时间所希望的。要求m >= n允许对代码进行小的简化并提高效率。因此，我提出了m >= n的要求。

try_one 100, 100
9.44 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99999136, 99999127, 99999125, 99999109, 99999078]

try_one 100, 1000
9.53 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99999136, 99999127, 99999125, 99999109, 99999078]

try_one 100, 10_000
9.95 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99999136, 99999127, 99999125, 99999109, 99999078]

在这里，我测试了产生100最大值的情况，该最大值具有一次m要读取的文件的不同行数。可以看出，该方法对后一个值不敏感。如预期的那样，在所有情况下，最大的5个值和最小的5个值（在返回的100个值中）相同。

try_one 1_000
9.31 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99990425, 99990423, 99990415, 99990406, 99990399]

try_one 1000, 10_000
9.24 seconds

实际上，返回1,000个最大值所需的时间比返回最大100个值所需的时间略短。我希望这是不可复制的。前5个当然与返回最大的100个值时相同。因此，我将不在下面显示该行。当然，返回的1000个值中的最小5个值比返回最大的100个值时要小。

try_one 10_000
12.15 seconds
bot 5: [99898951, 99898950, 99898946, 99898932, 99898922]

try_one 100_000
13.2 seconds
bot 5: [98995266, 98995259, 98995258, 98995254, 98995252]

try_one 1_000_000
14.34 seconds
bot 5: [89999305, 89999302, 89999301, 89999301, 89999287]

说明

请注意，将重用三个数组best，cand和new_best。具体而言，我多次替换了这些数组的内容，而不是连续创建新的（可能非常大的）数组，而使孤立的数组被垃圾回收。进行了一些测试，结果表明该方法提高了性能。

我们可以创建一个小示例，然后逐步进行计算。

fname = 'temp'

File.write(fname, 20.times.map { rand(100) }.join("\n") << "\n")
  #=> 58

此文件包含以下数组中的整数表示形式。

arr = File.read(fname).lines.map(&:to_i)
  #=> [9, 66, 80, 64, 67, 67, 89, 10, 62, 94, 41, 16, 0, 22, 68, 72, 41, 64, 87, 24]

对，这是：

arr.sort_by! { |n| -n }
  #=> [94, 89, 87, 80, 72, 68, 67, 67, 66, 64, 64, 62, 41, 41, 24, 22, 16, 10, 9, 0]

假设我们想要5个最大值。

arr[0,5]
  #=> [94, 89, 87, 80, 72]

首先，设置两个参数：n（要返回的最大值）和m（一次要从文件读取的行数）。

n = 5
m = 5

计算如下。

m < n
  #=> false, so do not raise ArgumentError 
f = File.open(fname)
  #=> #<File:temp> 
best = Array.new(n)
  #=> [nil, nil, nil, nil, nil] 
n.times { |i| f.eof? ? (return best.replace(best[0,i])) : best[i] = f.readline.to_i }
best
  #=> [9, 66, 80, 64, 67]
best.sort!.reverse!
  #=> [80, 67, 66, 64, 9] 
f.eof?
  #=> false, so do not return 
new_best = Array.new(n)
  #=> [nil, nil, nil, nil, nil] 
cand = Array.new(m)
  #=> [nil, nil, nil, nil, nil]
puts "best=#{best}".rjust(52) 
until f.eof?
  rd(f, cand)
  merge_arrays(best, new_best, cand)
  puts "cand=#{cand}, best=#{best}"
end
f.close
best
  #=> [94, 89, 87, 80, 72]

显示以下内容。

                           best=[80, 67, 66, 64,  9]
cand=[94, 89, 67, 62, 10], best=[94, 89, 80, 67, 67]
cand=[68, 41, 22, 16,  0], best=[94, 89, 80, 68, 67]
cand=[87, 72, 64, 41, 24], best=[94, 89, 87, 80, 72]

我必须编写一个程序，该程序在给定X数和巨大文件大小的情况下输出最大X数

4 个答案: