Question

我已经使用gettextfile方法从ftp服务器获取记录，并处理给定块中的每条记录，最后将其放在其他位置。

这个文件是一个CSV文件，我需要使用CSV来获取标题和数据，并在完成一些工作后将其放在数据库中。由于我有许多不同的文件，我需要一种通用的方式。我不想在内存或磁盘上加载所有记录，因为文件可能非常大！所以一个流将是好的

一个想法是给CSV提供一个io对象，但我不知道如何用Net :: FTP做到这一点。

我已经看到“http://stackoverflow.com/questions/5223763/how-to-ftp-in-ruby-without-first-saving-the-text-file”，但它可以与PUT一起使用。

任何帮助？

Answer 1

Justin提到的技术会创建temporary file。

您可以使用retrlines：

   filedata = ''
   ftp.retrlines("RETR " + filename) do |block|
      filedata << block
   end

改为

或retrbinary：

   filedata = ''
   ftp.retrbinary("RETR " + filename, Net::FTP::DEFAULT_BLOCKSIZE) do |block|
      filedata << block
   end

Answer 2

我认为你通常使用gettextfile来解决这个问题。您可以将文件的一部分累积到Array中，然后在达到某个阈值时使用CSV处理该文件。以下是一些未经测试的代码，一次处理十行：

current_line = 0
chunk = []

ftp.gettextfile('file.csv') do |line|
  chunk << line
  process_chunk!(chunk) if current_line % 10 == 0
  current_line += 1
end

process_chunk!(chunk) # Any remaining lines in final partial chunk

def process_chunk!(lines_in_chunk)
  # process partial chunk of lines as if it were the whole file
  lines_in_chunk = []
end

这对我来说似乎是一个更简单的解决方案，但你也可能在多个unix进程（写入和读取STDOUT）或生产者 - 消费者模型中的Ruby线程中工作。

Answer 3

我提出的解决方案使用IO.pipe，一个线程来迭代FTP文件中的文本行（其中一些可能是引号内的行片段）和每行puts IO作家。

在主线程中，我基于IO读取器创建一个CSV实例，并从中重复解析的行。

require 'CSV'

def stream_ftp_csv_test(ftp, filename)
  read_io, write_io = IO.pipe

  fetcher = Thread.new do
    begin
      ftp.gettextfile filename do |line|
        write_io.puts line
      end
    ensure
      write_io.close
    end
  end

  csv = CSV.new(read_io, headers: :first_row)
  csv.each do |row|
    # Printing the row hashes here as an example.
    # You could yield each one to a given block
    # argument or whatever else makes sense.
    p row.to_h
  end

  fetcher.join
ensure
  read_io.close if read_io
end

如何在Ruby中获取FTP记录而不先保存文本文件并使用它提供CSV

3 个答案: