性能测试结果

Question

我使用get-content来读取较大的文件（252 MB），但是当我使用get-content来读取它时，powershell进程会消耗大约10 GB的内存。这是正常行为吗？

该阵列只有600万件物品。它似乎与使用的内存量不相符。

也许我只是完全以错误的方式解决这个问题。

我想将与字符串匹配的行和后续行写入新文本文件。

$mytext = get-content $inpath
$search = "*tacos*"
$myindex = 0..($mytext.count - 1) | Where {$mytext[$_] -like $search}
$outtext = @()
foreach ($i in $myindex){
    $outtext = $outtext + $mytext[$i] + $mytext[$i+1]
    }
$outtext | out-file -filepath $outpath

性能测试结果

我根据不同的答案为不同的脚本提供了一个性能示例。

我的原始剧本

（对写出的行数高度敏感）

10k行--1.8s
100k行 - 38s
100k行 - 21s（很少发生搜索字符串）
5000k行 - 测量时间太长（几小时后中止）

没有获取内容的Select-String（改编自whatever）

Select-String -path $inpath -pattern $search -Context 0,1 -SimpleMatch | Out-File $outpath

10k行 - 1.2s
100k行--4s
1000k行--107s

注意，输入速度增加10倍，处理速度仅增加~4倍。您尝试一次处理的数据越多，此解决方案相对于其他解决方案就越好。

消除数组调整大小（来自Mathias）

10k行--2.0s
100k行 - 25s
1000k行 - 1533s（使用1.7GB内存，与在1000k行上运行脚本外的gc相同）

使用管道（来自Chris Dent）

100k行 - 26s

Answer 1

进程继续消耗大约10 GB的内存。 [...]阵列只有600万件物品。它似乎与使用的内存量不相符。

Get-Content针对600万行的文件导致600万个字符串对象 - 并且分配字符串对象不仅为字符本身分配内存，还为对象标题和额外开销分配内存。

这只占你所看到的约5-10％ - 真正的问题是这个结构：

$outtext = @() # this
foreach ($i in $myindex){
    $outtext = $outtext + $mytext[$i] + $mytext[$i+1] # and this
}

每次重新分配数组的值时，都必须调整底层数组的大小，从而导致.NET将内容复制到一个新数组。

将其更改为：

$outtext = foreach ($i in $myindex){
    $mytext[$i],$mytext[$i+1]
}

Answer 2

管道是你的朋友。除了花费更长的时间并在内存中添加更多内容之外，从索引过程中获得的好处没有任何好处。

这将获取您正在搜索的行，以及您需要的一行上下文（来自示例）。除了与您的搜索匹配的项目加上该行之外，没有任何内容加载到内存中。

$getNext = $false
$outtext = Get-Content $inPath | ForEach-Object {
    if ($_ -like $search) {
        $_
        $getNext = $true
    }
    elseif ($getNext) { #reads the following line on next iteration
        $_
        $getNext = $false
    }
}

Answer 3

另一个选项是Select-String：

$search = "tacos"
Get-Content $inpath | Select-String $search -Context 0,1 | Out-File $OutputFile -Append

然而，这会产生略微改变的输出：

match
following line

将变成

> match
  following line

如果你想要文件中的确切行：

Get-Content $inpath | Select-String $search -Context 0,1 | foreach {$_.Line | Out-File $OutputFile -Append ; $_.Context.Postcontext |  Out-File $OutputFile -Append}

顺便说一句：一旦文件变得非常大，Get-Content会变得有点慢。一旦发生这种情况，最好这样做：

$TMPVar = Get-Content $inpath -Readcount 0
$TMPVar | Select-String....

这将使Get-Content一次性读取整个文件，而不是逐行读取，这比快速更快但需要更多内存，而不是将其直接导入下一个cmdlet。

PS获取内容高内存使用 - 是否有更有效的方法来过滤文件？

性能测试结果

3 个答案: