Question

我需要解析一个大的管道分隔文件，以计算第5列符合且不符合我标准的记录数。

PS C:\temp> gc .\items.txt -readcount 1000 | `
  ? { $_ -notlike "HEAD" } | `
  % { foreach ($s in $_) { $s.split("|")[4] } } | `
  group -property {$_ -ge 256} -noelement | `
  ft –autosize

这个命令做我想要的，返回如下输出：

  Count Name
  ----- ----
1129339 True
2013703 False

但是，对于500 MB的测试文件，此命令需要大约5.5分钟才能运行，如Measure-Command所测量。一个典型的文件超过2 GB，等待20多分钟是不合需要的长。

您是否看到了提高此命令性能的方法？

例如，有没有办法确定Get-Content的ReadCount的最佳值？没有它，完成相同的文件需要8.8分钟。

Answer 1

你试过StreamReader吗？我认为Get-Content在使用它之前将整个文件加载到内存中。

StreamReader class

Answer 2

使用@ Gisli的提示，这是我最终得到的脚本：

param($file = $(Read-Host -prompt "File"))
$fullName = (Get-Item "$file").FullName
$sr = New-Object System.IO.StreamReader("$fullName")
$trueCount = 0; 
$falseCount = 0; 
while (($line = $sr.ReadLine()) -ne $null) {
      if ($line -like 'HEAD|') { continue }
      if ($line.split("|")[4] -ge 256) { 
            $trueCount++
      }
      else {
            $falseCount++
      }
}
$sr.Dispose() 
write "True count:   $trueCount"
write "False count: $falseCount"

它在大约一分钟内产生相同的结果，这符合我的性能要求。

Answer 3

使用StreamReader添加另一个示例，以读取非常大的IIS日志文件并输出所有唯一的客户端IP地址和一些性能指标。

$path = 'A_245MB_IIS_Log_File.txt'
$r = [IO.File]::OpenText($path)

$clients = @{}

while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()

    # String processing here...
    if (-not $line.StartsWith('#')) {
        $split = $line.Split()
        $client = $split[-5]
        if (-not $clients.ContainsKey($client)){
            $clients.Add($client, $null)
        }
    }
}

$r.Dispose()
$clients.Keys | Sort

与Get-Content进行一点性能比较：

StreamReader ：已完成：5.5秒，PowerShell.exe：35,328 KB RAM。

获取内容：已完成：23.6秒。 PowerShell.exe：1,110,524 KB RAM。

需要帮助提高PowerShell分隔文本解析脚本的性能

3 个答案: