我试图获取一个大文件并将其拆分为许多较小的文件。每次拆分发生的位置基于检查每个给定行(isNextObject
函数)的内容返回的谓词。
我试图通过File.ReadLines
函数读取大文件,这样我就可以一次遍历文件一行而不必将整个文件保存在内存中。我的方法是将序列分组为一系列较小的子序列(每个文件要写出一个)。
我找到了Tomas Petricek在fssnip上创建的一个名为groupWhen的有用函数。这个函数非常适合我对文件的一小部分进行初始测试,但在使用真实文件时会抛出StackoverflowException。我不确定如何调整组时功能以防止这种情况(我仍然是F#greenie)。
以下是代码的简化版本,仅显示将重新创建StackoverflowExcpetion ::
的相关部分// This is the function created by Tomas Petricek where the StackoverflowExcpetion is occuring
module Seq =
/// Iterates over elements of the input sequence and groups adjacent elements.
/// A new group is started when the specified predicate holds about the element
/// of the sequence (and at the beginning of the iteration).
///
/// For example:
/// Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
let groupWhen f (input:seq<_>) = seq {
use en = input.GetEnumerator()
let running = ref true
// Generate a group starting with the current element. Stops generating
// when it founds element such that 'f en.Current' is 'true'
let rec group() =
[ yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group() // *** Exception occurs here ***
else running := false ]
if en.MoveNext() then
// While there are still elements, start a new group
while running.Value do
yield group() |> Seq.ofList }
这是使用Tomas&#39;的代码的要点。功能:
module Extractor =
open System
open System.IO
open Microsoft.FSharp.Reflection
// ... elided a few functions include "isNextObject" which is
// a string -> bool (examines the line and returns true
// if the string meets the criteria to that we are at the
// start of the next inner file)
let writeFile outputDir file =
// ... write out "file" to the file system
// NOTE: file is a seq<string>
let writeFiles outputDir (files : seq<seq<_>>) =
files
|> Seq.iter (fun file -> writeFile outputDir file)
以下是控制台应用程序中使用这些函数的相关代码:
let lines = inputFile |> File.ReadLines
writeFiles outputDir (lines |> Seq.groupWhen isNextObject)
任何关于正确的方法来阻止组合的想法什么时候吹掉堆栈?我不确定如何将函数转换为使用累加器(或者使用延续代替,我认为这是正确的术语)。
答案 0 :(得分:7)
这个问题是group()
函数返回一个列表,这是一个急切评估的数据结构,这意味着每次调用group()
时它必须运行到最后,收集所有结果列表,并返回列表。这意味着递归调用在同一个评估中发生 - 即真正递归, - 从而产生堆栈压力。
要缓解此问题,您只需使用延迟序列替换列表:
let rec group() = seq {
yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group()
else running := false }
但是,我会考虑不那么激烈的方法。这个例子很好地说明了为什么你应该避免自己做递归并改为使用现成的折叠。
例如,根据您的描述判断,似乎Seq.windowed
可能适合您。
答案 1 :(得分:6)
在F#,IMO中过度使用序列很容易。你可能会意外地获得堆栈溢出,而且它们很慢。
所以(实际上没有回答你的问题), 我个人只是使用类似的东西来折叠线条:
let isNextObject line =
line = "---"
type State = {
fileIndex : int
filename: string
writer: System.IO.TextWriter
}
let makeFilename index =
sprintf "File%i" index
let closeFile (state:State) =
//state.writer.Close() // would use this in real code
state.writer.WriteLine("=== Closing {0} ===",state.filename)
let createFile index =
let newFilename = makeFilename index
let newWriter = System.Console.Out // dummy
newWriter.WriteLine("=== Creating {0} ===",newFilename)
// create new state with new writer
{fileIndex=index + 1; writer = newWriter; filename=newFilename }
let writeLine (state:State) line =
if isNextObject line then
/// finish old file here
closeFile state
/// create new file here and return updated state
createFile state.fileIndex
else
//write the line to the current file
state.writer.WriteLine(line)
// return the unchanged state
state
let processLines (lines: string seq) =
//setup
let initialState = createFile 1
// process the file
let finalState = lines |> Seq.fold writeLine initialState
// tidy up
closeFile finalState
(显然真正的版本会使用文件而不是控制台)
是的,它很粗糙,但很容易推理 没有令人不快的意外。
这是一个测试:
processLines [
"a"; "b"
"---";"c"; "d"
"---";"e"; "f"
]
以下是输出结果:
=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===