Question

我有一个问题，我需要在一个非常大的文件中读取，然后打印每个块的解析结果。最后不是一个完整的清单。

到目前为止，我可以在MapSet中获取uniq结果，但无法弄清楚如何根据chunk_size

写入文件

使用此方法获取唯一的文件名

def new_file_name do
  hex = :crypto.hash(:md5, Integer.to_string(:os.system_time(:millisecond)))
    |> Base.encode16
end

到目前为止，我所拥有的最好的是它给了我一个具有块大小的独特结果的MapSet列表。这是一个MapSets列表，它可能最终导致内存过大。

def parse(file_path, chunk_size) do
  file_path
    |> File.stream!(read_ahead: chunk_size)
    |> Stream.drop(1)  # remove header
    |> Stream.map(&"#{&1}\")  # Prepare to be written as a csv
    |> Stream.chunk(chunk_size, chunk_size, [])  # break up into chunks
    |> method # method to write per chunk to file. 
end

我之前有过的是

|> Stream.map(&MapSet.new(&1))  # Create MapSet of unique values from each chunk

但我似乎无法找到任何将MapSet写入文件的例子。

Answer 1

您可以使用Enum.reduce/3将文件句柄作为累加器来打开文件一次，然后一次写入一个块：

def parse(file_path, chunk_size) do
  file_path
  |> File.stream!(read_ahead: chunk_size)
  |> Stream.drop(1)  # remove header
  |> Stream.map(&"#{&1}\")  # Prepare to be written as a csv
  |> Stream.chunk(chunk_size, chunk_size, [])  # break up into chunks
  |> Enum.reduce(File.open!("output.txt", [:write]), fn chunk, file ->
    :ok = IO.write(file, chunk)
    file
  end)
end

您可能希望调整将块写入文件的方式。以上内容将chunk视为iodata，有效地连接块中的字符串并进行编写。

如果您只想为每个块写入唯一的项目，可以添加：

|> Stream.map(fn chunk -> chunk |> MapSet.new |> MapSet.to_list end)

在进入Enum.reduce/3之前

。

Answer 2

在@Dogbert的帮助下找到了一种有趣的方法。使用Stream会锁定我最大100％的CPU使用率。有了这个，我能够达到最高256％的CPU使用率。这是在几个300MB的文件上运行的。 30分钟解析。

def alt_flow_parse_dir(path, out_file, chunk_size) do
  concat_unique =  File.open!(path <> "/" <> out_file, [:read, :utf8, :write])

  Path.wildcard(path <> "/*.csv")
    |> Flow.from_enumerable
    |> Flow.map(&append_to_file(&1, path, concat_unique, chunk_size))
    |> Flow.run

  File.close(concat_unique)
end

# I just want the unique items of the first column
def append_to_file(filename, path, out_file, chunk_size) do
  file = filename
    |> String.split("/")
    |> Enum.take(-1)
    |> List.to_string
  path <> file
    |> File.stream!
    |> Stream.drop(1)
    |> Flow.from_enumerable
    |> Flow.map(&String.split(&1, ",") |> List.first)
    |> Flow.map(&String.trim(&1,"\n"))
    |> Flow.partition
    |> Stream.chunk(chunk_size, chunk_size, [])
    |> Flow.from_enumerable
    |> Flow.map(fn chunk ->
        chunk
          |> MapSet.new
          |> MapSet.to_list
          |> List.flatten
      end)
    |> Flow.map(fn line ->
        Enum.map(line, fn item ->
            IO.puts(out_file, item)
          end)
        end)
     |> Flow.run
  end

如何在Elixir的Stream中为每个块编写一个文件

2 个答案: