我正在尝试在0到3列之间格式化大型文本文件(~300MB):
12345|123 Main St, New York|91110
23456|234 Main St, New York
34567|345 Main St, New York|91110
输出应为:
000000000012345,"123 Main St, New York",91110,,,,,,,,,,,,
000000000023456,"234 Main St, New York",,,,,,,,,,,,,
000000000034567,"345 Main St, New York",91110,,,,,,,,,,,,
我是PowerShell的新手,但我读过我应该避免使用Get-Content,所以我使用的是StreamReader。它仍然太慢了:
function append-comma{} #helper function to append the correct amount of commas to each line
$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"
$target_file_in = New-Object System.IO.StreamReader -Arg $infile
If ($header -eq 'TRUE') {
$firstline = $target_file_in.ReadLine() #skip header if exists
}
while (!$target_file_in.EndOfStream ) {
$line = $target_file_in.ReadLine()
$a = $line.split($separator)[0].trim()
$b = ""
$c = ""
if ($dataType -eq 'ECN'){$a = $a.padleft(15,'0')}
if ($line.split($separator)[1].length -gt 0){$b = $line.split($separator)[1].trim()}
if ($line.split($separator)[2].length -gt 0){$c = $line.split($separator)[2].trim()}
$line = $a +',"'+$b+'","'+$c +'"'
$line -replace '(?m)"([^,]*?)"(?=,|$)', '$1' |append-comma >> $outfile
}
$target_file_in.close()
我正在为我的团队中的其他人构建此功能,并希望使用此指南添加gui: http://blogs.technet.com/b/heyscriptingguy/archive/2014/08/01/i-39-ve-got-a-powershell-secret-adding-a-gui-to-scripts.aspx
在Powershell中有更快的方法吗? 我使用Linux bash(Windows上的Cygwin64)和Python中的单独脚本编写了一个脚本。两者都运行得更快,但我试图编写一些在Windows平台上“批准”的内容。
答案 0 :(得分:2)
所有分裂和替换都会花费你方式比从StreamReader
获得更多的时间。下面的代码将执行时间缩短到了20%左右:
$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"
if ($header -eq 'TRUE') {
$linesToSkip = 1
} else {
$linesToSkip = 0
}
Get-Content $infile | select -Skip $linesToSkip | % {
[int]$a, [string]$b, [string]$c = $_.split($separator)
'{0:d15},"{1}",{2},,,,,,,,,,,,,' -f $a, $b.Trim(), $c.Trim()
} | Set-Content $outfile
答案 1 :(得分:1)
这对你有什么用?我能够在廉价的工作站上大约40秒内读取并处理35MB文件。
文件大小:36,548,820字节
已处理:39.7259722秒
Function CheckPath {
[CmdletBinding()]
param(
[Parameter(Mandatory=$True,
ValueFromPipeline=$True)]
[string[]]$Path
)
BEGIN {}
PROCESS {
IF ((Test-Path -LiteralPath $Path) -EQ $False) {Write-host "Invalid File Path $Path"}
}
END {}
}
$infile = "infile.txt"
$outfile = "restult5.txt"
#Check File Path
CheckPath $InFile
#Initiate StreamReader
$Reader = New-Object -TypeName System.IO.StreamReader($InFile);
#Create New File Stream Object For StreamWriter
$WriterStream = New-Object -TypeName System.IO.FileStream(
$outfile,
[System.IO.FileMode]::Create,
[System.IO.FileAccess]::Write);
#Initiate StreamWriter
$Writer = New-Object -TypeName System.IO.StreamWriter(
$WriterStream,
[System.Text.Encoding]::ASCII);
If ($header -eq $True) {
$Reader.ReadLine() |Out-Null #Skip First Line In File
}
while ($Reader.Peek() -ge 0) {
$line = $Reader.ReadLine() #Read Line
$Line = $Line.split('|') #Split Line
$OutPut = "$($($line[0]).PadLeft(15,'0')),`"$($Line[1])`",$($Line[2]),,,,,,,,,,,,"
$Writer.WriteLine($OutPut)
}
$Reader.Close();
$Reader.Dispose();
$Writer.Flush();
$Writer.Close();
$Writer.Dispose();
$endDTM = (Get-Date) #Get Script End Time For Measurement
Write-Host "Elapsed Time: $(($endDTM-$startDTM).totalseconds) seconds" #Echo Time elapsed
答案 2 :(得分:0)
正则表达式很快:
$infile = ".\large_data.csv"
gc $infile|%{
$x=if($_.indexof('|')-ne$_.lastindexof('|')){
$_-replace'(.+)\|(.+)\|(.+)',('$1,"$2",$3'+','*12)
}else{
$_-replace'(.+)\|(.+)',('$1,"$2"'+','*14)
}
('0'*(15-($x-replace'([^,]),.+','$1').length))+$x
}
答案 3 :(得分:0)
我有另一种方法。让powershell将输入文件作为csv文件读取,并使用管道符作为分隔符。然后按照您希望的方式格式化输出。我没有用大文件测试这个速度。
$infile = "\large-data.csv"
$outfile = "new-file.csv"
import-csv $infile -header id,addr,zip -delimiter "|" |
% {'{0},"{1}",{2},,,,,,,,,,,,,' -f $_.id.padleft(15,'0'), $_.addr.trim(), $_.zip} |
set-content $outfile