Powershell脚本的速度。寻求优化

时间:2017-01-27 10:28:46

标签: performance powershell csv if-statement

我有一个工作脚本,其目的是在导入Oracle之前解析格式错误的行的数据文件。使用>处理450MB csv文件100万行有8列,需要2.5小时,最多只需一个CPU核心。小文件快速完成(以秒为单位)。

奇怪的是,具有相似行数和40列的350MB文件只需要30分钟。

我的问题是这些文件会随着时间的推移而增长,并且2.5小时会占用一个不错的CPU。谁能推荐代码优化?一个类似的标题帖推荐了本地路径 - 我已经在做了。

$file = "\Your.csv"

$path = "C:\Folder"

$csv  = Get-Content "$path$file"

# Count number of file headers
$count = ($csv[0] -split ',').count

# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"

# 2 validation steps: (1) count number of headers is ge (2) Row split after first col.  Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
  if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
     $stream1.WriteLine($_)
  } else {
     $stream2.WriteLine($_) 
  }
}
$stream1.close()
$stream2.close()

示例数据文件:

C1,C2,C3,C4,C5,C6,C7,C8
ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10
ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10
ABC,,2016-06-22,,201501
,,,,,,,,
ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10
ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10

3 个答案:

答案 0 :(得分:6)

  • 当文件在所有PowerShell版本(包括5.1)上包含数百万行时,在默认模式下生成数组的Get-Content非常慢。更糟糕的是,您将其分配给变量,因此在读取整个文件并将其拆分为行之前,不会发生任何其他情况。在英特尔i7 3770K CPU上,3.9GHz $csv = Get-Content $path需要2分钟以上才能读取一个包含800万行的350MB文件。

    解决方案:使用IO.StreamReader读取一行并立即处理 在PowerShell2中,StreamReader不如PS3 +优化,但仍比Get-Content快。

  • 通过|while语句(不是cmdlet)等流控制语句,通过foreach进行流水线操作至少比直接枚举慢几倍。 解决方案:使用语句。
  • 将每一行拆分成一个字符串数组比仅操作一个字符串要慢 解决方案:使用IndexOfReplace方法(非运算符)计算字符出现次数。
  • PowerShell始终在使用循环时创建内部管道 解决方案:在这种情况下,使用Invoke-Command { } trick加速2-3倍!

以下是与PS2兼容的代码 它在PS3 +中速度更快(在PC上350MB csv的800万行中为30秒)。

$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)
$header = $reader.ReadLine()
$numCol = $header.Split(',').count

$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer1.WriteLine($header)
$writer2.WriteLine($header)

Write-Progress 'Filtering...' -status ' '
$watch = [Diagnostics.Stopwatch]::StartNew()
$currLine = 0

Invoke-Command { # the speed-up trick: disables internal pipeline
while (!$reader.EndOfStream) {
    $s = $reader.ReadLine()
    $slen = $s.length
    if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){
        $writer1.WriteLine($s)
    } else {
        $writer2.WriteLine($s)
    }
    if (++$currLine % 10000 -eq 0) {
        $pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length
        Write-Progress 'Filtering...' -status "Line: $currLine" `
            -PercentComplete ($pctDone * 100) `
            -SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)
    }
}
} #Invoke-Command end

Write-Progress 'Filtering...' -Completed -status ' '
echo "Elapsed $($watch.Elapsed)"

$reader.close()
$writer1.close()
$writer2.close()

另一种方法是在两次传递中使用正则表达式(但它比上面的代码慢)。
由于数组元素属性的简写语法,需要PowerShell 3或更高版本:

$text = [IO.File]::ReadAllText('r:\data.csv')
$header = $text.substring(0, $text.indexOfAny("`r`n"))
$numCol = $header.split(',').count

$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"
[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +
                                    ($rx.matches($text).groups.value -join "`r`n"))
[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))

答案 1 :(得分:3)

如果您想安装awk,您可以在一秒钟内完成1,000,000条记录 - 对我来说似乎是一个很好的优化: - )

awk -F, '
   NR==1                    {f=NF; printf("Expecting: %d fields\n",f)}  # First record, get expected number of fields
   NF!=f                    {print > "Fail.txt"; next}                  # Fail for wrong field count
   length($0)-length($1)<40 {print > "Fail.txt"; next}                  # Fail for wrong length
                            {print > "Pass.txt"}                        # Pass
   ' MillionRecord.csv

您可以从here获取适用于Windows的gawk

Windows在参数中有单引号有点尴尬,所以如果在Windows下运行,我会使用相同的代码,但格式如下:

将其保存在名为commands.awk

的文件中
NR==1                    {f=NF; printf("Expecting: %d fields\n",f)}
NF!=f                    {print > "Fail.txt"; next}
length($0)-length($1)<40 {print > "Fail.txt"; next}
                         {print > "Pass.txt"}

然后运行:

awk -F, -f commands.awk Your.csv

这个答案的其余部分与评论部分中提到的“Beat hadoop with shell”挑战有关,我想在某个地方保存我的代码,所以它就在这里....我的iMac上的6.002秒超过了1543个文件中的3.5GB,总计约1.04亿条记录:

#!/bin/bash
doit(){
   awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $@
}

export -f doit
find . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}

答案 2 :(得分:1)

尝试尝试不同的循环策略,例如,切换到 for 循环可将处理时间缩短50%以上,例如:

[String]                 $Local:file           = 'Your.csv';
[String]                 $Local:path           = 'C:\temp';
[System.Array]           $Local:csv            = $null;
[System.IO.StreamWriter] $Local:objPassStream  = $null;
[System.IO.StreamWriter] $Local:objFailStream  = $null; 
[Int32]                  $Local:intHeaderCount = 0;
[Int32]                  $Local:intRow         = 0;
[String]                 $Local:strRow         = '';
[TimeSpan]               $Local:objMeasure     = 0;

try {
    # Load.
    $objMeasure = Measure-Command {
        $csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop;
        $intHeaderCount = ($csv[0] -split ',').count;
        } #measure-command
    'Load took {0}ms' -f $objMeasure.TotalMilliseconds;

    # Create stream writers.
    try {
        $objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file );
        $objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file );

        # Process CSV (v1).
        $objMeasure = Measure-Command {
            $csv | Select-Object -Skip 1 | Foreach-Object { 
                if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) {
                    $objPassStream.WriteLine( $_ );   
                } else {
                    $objFailStream.WriteLine( $_ );
                } #else-if
                } #foreach-object
            } #measure-command
        'Process took {0}ms' -f $objMeasure.TotalMilliseconds;

        # Process CSV (v2).
        $objMeasure = Measure-Command {
            for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) {
                if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) {
                    $objPassStream.WriteLine( $csv[$intRow] );   
                } else {
                    $objFailStream.WriteLine( $csv[$intRow] );
                } #else-if
                } #for
            } #measure-command
        'Process took {0}ms' -f $objMeasure.TotalMilliseconds;

        } #try
    catch [System.Exception] {
        'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message;
         } #catch
    finally {
        $objFailStream.close();
        $objPassStream.close();    
        } #finally

   } #try
catch [System.Exception] {
    'ERROR : Failed to load CSV.';
    } #catch

exit 0;