Question

我正在使用此方法处理包含大约220,000行的单个文本文件。处理一个需要几分钟，但我有很多。有没有建议让这个过程更快？

def parse_list(file_path,import=false)
# Parse the fixed-length fields
   if File.exist?(file_path)
     result=[]
     File.readlines(file_path)[5..-1].each do |rs|
        if rs.length > 140
          r=rs.strip
          unless r=='' 
            filing={
                  'name' => r[0..50].strip,
                  'form' => r[51..70].strip,
                  'type'  => r[71..80].strip,
                  'date' => r[81..90].strip,
                  'location' => r[91..-1].strip
                  }     
              result.push(filing)
          end
        end
     end
     return result
   else
     return false
   end
end

更新

最初，我认为使用Nex和thetinman的方法节省了大量时间，所以我继续测试它们，使解析方法保持一致。

使用我原来的r[].strip解析方法，但使用Nex的each_line块方法和thetinman的foreach方法：

Rehearsal ---------------------------------------------
Nex         8.260000   0.130000   8.390000 (  8.394067)
Thetinman   9.740000   0.120000   9.860000 (  9.862880)
----------------------------------- total: 18.250000sec

                user     system      total        real
Nex        14.270000   0.140000  14.410000 ( 14.397286)
Thetinman  19.030000   0.080000  19.110000 ( 19.118621)

使用thetinman的unpack.map解析方法再次运行：

Rehearsal ---------------------------------------------
Nex         9.580000   0.120000   9.700000 (  9.694327)
Thetinman  11.470000   0.090000  11.560000 ( 11.567294)
----------------------------------- total: 21.260000sec

                user     system      total        real
Nex        15.480000   0.120000  15.600000 ( 15.599319)
Thetinman  18.150000   0.070000  18.220000 ( 18.217744)

unpack.map(&:strip) vs r[].strip：unpack map Rehearsal --------------------------------------------- Original 7.980000 0.140000 8.120000 ( 8.118340) Nex 9.460000 0.080000 9.540000 ( 9.546889) Thetinman 10.980000 0.070000 11.050000 ( 11.042459) ----------------------------------- total: 28.710000sec user system total real Original 16.280000 0.140000 16.420000 ( 16.414070) Nex 15.370000 0.080000 15.450000 ( 15.454174) Thetinman 20.100000 0.090000 20.190000 ( 20.195533)似乎不会提高速度，但这是一种有趣的方法，可以在将来使用。

我发现了一个不同的问题：由于节省了大量时间，我想，我继续使用pry手动运行Nex和thetinman的方法。这就是我发现计算机挂起的地方，就像我的原始代码一样。所以我继续测试，但是用我的原始代码。

original_method.count

我的代码，Nex和thetinman的方法看似可比，Nex使用Benchmark是最快的。但是，Benchmark似乎并不能说明整个故事，因为使用pry手动测试代码会使所有方法花费更长的时间，所以我要在取回结果之前取消。

我还有一些问题：

在IRB / Pry中运行这样的东西有什么特别之处会产生这些奇怪的结果，使代码运行得慢得多吗？
如果我运行nex_method.count，thetinmans_method.count或activerecord-import，他们似乎都会快速返回。
由于内存问题和可扩展性，建议不要使用原始方法。但是，将来有没有办法用基准测试来测试内存使用情况？

使用def parse_line(line) filing={ 'name' => line[0..50].strip, 'form' => line[51..70].strip, 'type' => line[71..80].strip, 'date' => line[81..90].strip, 'location' => line[91..-1].strip } end def import_files result=[] parse_list_nix(file_path){|line| filing=parse_line(line) result.push(Filing.new(filing)) } Filing.import result #result is an array of new records that are all imported at once end更新NEX：

@nex，这是你的意思吗？这似乎对我来说仍然很慢，但是当你说：

在该块中导入一组数据。

您如何建议修改它？

activerecord-import

正如您所见，Rehearsal ------------------------------------------ import 534.840000 1.860000 536.700000 (553.507644) ------------------------------- total: 536.700000sec user system total real import 263.220000 1.320000 264.540000 (282.751891)方法的结果大大减慢了：

{{1}}

这种缓慢导入过程看起来是否正常？

这对我来说似乎超级慢。我正试图弄清楚如何加快速度，但我没有想法。

Answer 1

如果没有样本数据，很难确认这一点，但是，基于原始代码，我可能会写这样的东西：

require 'english'

# Parse the fixed-length fields
def parse_list(file_path,import=false)

  return false unless File.exist?(file_path)

  result=[]
  File.foreach(file_path) do |rs|
    next unless $INPUT_LINE_NUMBER > 5
    next unless rs.length > 140

    r = rs.strip
    if r > '' 
      name, form, type, date, location = r.unpack('A51 A20 A10 A10 A*').map(&:strip)
      result << {
        'name'     => name,
        'form'     => form,
        'type'     => type,
        'date'     => date,
        'location' => location
      }
    end
  end

  result
end

220,000行不是我来自的大文件。我们在上午中午之前获取3x的日志文件，因此使用任何文件I / O来篡改该文件。 Ruby的IO类有两种逐行I / O方法和一种返回数组的数字。你想要前者，因为它们是可扩展的。除非你能保证正在读取的文件能够很好地适应Ruby的内存，所以请避免使用后者。

Answer 2

问题在于你填满了记忆。你打算怎么处理这个结果？它是否必须作为一个整体留在你的记忆中，或者它是一个选择，只是一个一行一块地处理它？</ p>

此外，您不应该在这里使用readlines。做这样的事情，因为它使用了枚举器：

def parse_list(file_path, import=false)
  i = 0
  File.open(file_path,'r').each_line do |line|
    line.strip!
    next if (i+=1) < 5 || line.length < 141
    filing = { 'name' => r[0..50].strip,
               'form' => r[51..70].strip,
               'type'  => r[71..80].strip,
               'date' => r[81..90].strip,
               'location' => r[91..-1].strip }
    yield(filling) if block_given?
  end
end

# and calling it like this:
parse_list('/tmp/foobar'){ |filling|
  Filing.new(filing).import
}

如何提高将文件转换为哈希数组的过程性能？

2 个答案: