在Windows Powershell中格式化大型文本文件

时间:2015-05-15 21:57:26

标签: shell powershell

我正在尝试在0到3列之间格式化大型文本文件(~300MB):

12345|123 Main St, New York|91110
23456|234 Main St, New York
34567|345 Main St, New York|91110

输出应为:

000000000012345,"123 Main St, New York",91110,,,,,,,,,,,,
000000000023456,"234 Main St, New York",,,,,,,,,,,,,
000000000034567,"345 Main St, New York",91110,,,,,,,,,,,,

我是PowerShell的新手,但我读过我应该避免使用Get-Content,所以我使用的是StreamReader。它仍然太慢了:

function append-comma{} #helper function to append the correct amount of commas to each line


$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"

$target_file_in = New-Object System.IO.StreamReader -Arg $infile

If ($header -eq 'TRUE') {
    $firstline = $target_file_in.ReadLine() #skip header if exists
}

while (!$target_file_in.EndOfStream ) {

    $line = $target_file_in.ReadLine() 
    $a = $line.split($separator)[0].trim()
    $b = ""
    $c = ""
    if ($dataType -eq 'ECN'){$a = $a.padleft(15,'0')}
    if ($line.split($separator)[1].length -gt 0){$b = $line.split($separator)[1].trim()}
    if ($line.split($separator)[2].length -gt 0){$c = $line.split($separator)[2].trim()}
    $line = $a +',"'+$b+'","'+$c +'"'
    $line -replace '(?m)"([^,]*?)"(?=,|$)', '$1' |append-comma >> $outfile
}

$target_file_in.close()

我正在为我的团队中的其他人构建此功能,并希望使用此指南添加gui: http://blogs.technet.com/b/heyscriptingguy/archive/2014/08/01/i-39-ve-got-a-powershell-secret-adding-a-gui-to-scripts.aspx

在Powershell中有更快的方法吗? 我使用Linux bash(Windows上的Cygwin64)和Python中的单独脚本编写了一个脚本。两者都运行得更快,但我试图编写一些在Windows平台上“批准”的内容。

4 个答案:

答案 0 :(得分:2)

所有分裂和替换都会花费你方式比从StreamReader获得更多的时间。下面的代码将执行时间缩短到了20%左右:

$separator = '|'
$infile    = "\large_data.csv"
$outfile   = "new_file.csv"

if ($header -eq 'TRUE') {
  $linesToSkip = 1
} else {
  $linesToSkip = 0
}

Get-Content $infile | select -Skip $linesToSkip | % {
  [int]$a, [string]$b, [string]$c = $_.split($separator)
  '{0:d15},"{1}",{2},,,,,,,,,,,,,' -f $a, $b.Trim(), $c.Trim()
} | Set-Content $outfile

答案 1 :(得分:1)

这对你有什么用?我能够在廉价的工作站上大约40秒内读取并处理35MB文件。

文件大小:36,548,820字节

已处理:39.7259722秒

Function CheckPath {
[CmdletBinding()]
    param(
        [Parameter(Mandatory=$True,
        ValueFromPipeline=$True)]
        [string[]]$Path
    )
    BEGIN {}
    PROCESS {
        IF ((Test-Path -LiteralPath $Path) -EQ $False) {Write-host "Invalid File Path $Path"}
    }
    END {}
}

$infile = "infile.txt"
$outfile = "restult5.txt"

#Check File Path
CheckPath $InFile

#Initiate StreamReader
$Reader = New-Object -TypeName System.IO.StreamReader($InFile);

#Create New File Stream Object For StreamWriter
$WriterStream = New-Object -TypeName System.IO.FileStream(
 $outfile,
 [System.IO.FileMode]::Create,
 [System.IO.FileAccess]::Write);

#Initiate StreamWriter
$Writer = New-Object -TypeName System.IO.StreamWriter(
 $WriterStream,
 [System.Text.Encoding]::ASCII);

If ($header -eq $True) {
    $Reader.ReadLine() |Out-Null #Skip First Line In File
}

while ($Reader.Peek() -ge 0) {
    $line = $Reader.ReadLine() #Read Line
    $Line = $Line.split('|') #Split Line
    $OutPut = "$($($line[0]).PadLeft(15,'0')),`"$($Line[1])`",$($Line[2]),,,,,,,,,,,,"
    $Writer.WriteLine($OutPut)
}

$Reader.Close();
$Reader.Dispose();
$Writer.Flush();

$Writer.Close();
$Writer.Dispose();

$endDTM = (Get-Date) #Get Script End Time For Measurement

Write-Host "Elapsed Time: $(($endDTM-$startDTM).totalseconds) seconds" #Echo Time elapsed

答案 2 :(得分:0)

正则表达式很快:

$infile = ".\large_data.csv"
gc $infile|%{ 
    $x=if($_.indexof('|')-ne$_.lastindexof('|')){
        $_-replace'(.+)\|(.+)\|(.+)',('$1,"$2",$3'+','*12)
    }else{
        $_-replace'(.+)\|(.+)',('$1,"$2"'+','*14)
    }
    ('0'*(15-($x-replace'([^,]),.+','$1').length))+$x
}

答案 3 :(得分:0)

我有另一种方法。让powershell将输入文件作为csv文件读取,并使用管道符作为分隔符。然后按照您希望的方式格式化输出。我没有用大文件测试这个速度。

$infile = "\large-data.csv"
$outfile = "new-file.csv"

import-csv $infile -header id,addr,zip -delimiter "|" |
% {'{0},"{1}",{2},,,,,,,,,,,,,' -f $_.id.padleft(15,'0'), $_.addr.trim(), $_.zip} |
set-content $outfile