您好我有一个像这样的大文本文件
BIGFILE.TXT
COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
11/24/2013,50.67,51.22,50.67,51.12,17,0,FILE1
11/25/2013,51.34,51.91,51.09,51.87,23,0,FILE1
12/30/2013,51.76,51.82,50.86,51.15,13,0,FILE1
12/31/2013,51.15,51.33,50.45,50.76,18,0,FILE1
1/1/2014,50.92,51.58,50.84,51.1,19,0,FILE2
1/4/2014,51.39,51.46,50.95,51.21,14,0,FILE2
1/7/2014,51.08,51.2,49.84,50.05,35,0,FILE2
1/8/2014,50.14,50.94,50.01,50.78,100,0,FILE3
1/11/2014,50.63,51.41,50.52,51.3,190,0,FILE3
1/15/2014,54.03,55.74,53.69,54.93,110,0,FILE4
1/19/2014,53.67,54.19,53.55,53.82,24,0,FILE4
1/20/2014,53.83,54.26,53.47,53.53,23,0,FILE4
1/21/2014,53.8,54.55,53.7,54.1,24,0,FILE4
1/26/2014,53.26,53.93,53.23,53.65,31,0,FILE5
1/27/2014,53.78,54,53.64,53.81,110,0,FILE5
我正在寻找如何将此文件拆分为多个文本文件的方法。在这种情况下,一个文件将被拆分为5个文本文件。每个文本文件的名称将从第8列中获取。大文件以逗号分隔。所以输出将是:
FILE1.txt
COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
11/24/2013,50.67,51.22,50.67,51.12,17,0,FILE1
11/25/2013,51.34,51.91,51.09,51.87,23,0,FILE1
12/30/2013,51.76,51.82,50.86,51.15,13,0,FILE1
12/31/2013,51.15,51.33,50.45,50.76,18,0,FILE1
FILE2.TXT
COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
1/1/2014,50.92,51.58,50.84,51.1,19,0,FILE2
1/4/2014,51.39,51.46,50.95,51.21,14,0,FILE2
1/7/2014,51.08,51.2,49.84,50.05,35,0,FILE2
FILE3.TXT
COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
1/8/2014,50.14,50.94,50.01,50.78,100,0,FILE3
1/11/2014,50.63,51.41,50.52,51.3,190,0,FILE3
.
.
.
大文本文件有几千行。 有人知道怎么做吗?
谢谢你的帮助。 学家
答案 0 :(得分:3)
如果大文件有几千行,那就没那么大了,您可以使用Import-CSV
和Export-CSV
来处理内容。
$big = Import-Csv big.csv
$big | ? { $_.column8 -eq "file1" } | Export-Csv -NoTypeInformation file1.csv
# Output
cat .\file1.csv
"COLUMN1","COLUMN2","COLUMN3","COLUMN4","COLUMN5","COLUMN6","COLUMN7","COLUMN8"
"11/24/2013","50.67","51.22","50.67","51.12","17","0","FILE1"
"11/25/2013","51.34","51.91","51.09","51.87","23","0","FILE1"
"12/30/2013","51.76","51.82","50.86","51.15","13","0","FILE1"
"12/31/2013","51.15","51.33","50.45","50.76","18","0","FILE1"
另一方面,如果文件太大而系统在Import-CSV
上窒息,请使用IO.StreamReader()
读取文件并逐行处理文件。
编辑:
哦,好几千个输出文件处理起来有点棘手。具有大量Add-Content
的磁盘I / O是性能杀手,但对于单次操作,这样的事情应该起作用:
$src = "c:\temp\reallybig.csv" # Source file
$dst = "c:\temp\file{0}.csv" # Output file(s)
$reader = new-object IO.StreamReader($src) # Reader for input
while(($line = $reader.ReadLine()) -ne $null){ # Loop the input
$match = [regex]::match($line, "(?i)file(\d)") # Look for row that ends with file-and-number
if($match.Success){
# Add the line to respective output file. SLOW!
add-content $($dst -f $match.Groups[0].value) $line
}
}
$reader.Close() # Close the input file
为了提高性能,基于输出文件的StringBuilder
缓冲非常有效。
EDIT2:
这是另一个版本。它包含一个包含StringBuilder对象的哈希表。最后一列中的每个输出文件名都用作键,其值是包含文本数据的StringBuilder。这种方法将所有输出文件数据存储在内存中,因此可以加速x64和几千兆字节的RAM,以获得相当大的输入文件。缓冲区可以立即刷新到磁盘,然后保存到内存中;这需要额外的簿记。
$src = "c:\temp\reallybig.csv" # Source file
$dst = "c:\temp\file_{0}.csv" # Output file(s)
$reader = new-object IO.StreamReader($src) # Reader for input
$header = Get-Content -Path $src | select -First 1 # Get the header row
$ht = @{}
$line = $reader.ReadLine() # Skip the first line, it's alread in $header
while(($line = $reader.ReadLine()) -ne $null){ # Loop the input
$match = [regex]::match($line, '(?i)(\w+\d)$') # Look for row that ends with file-and-number
if($match.Success){
$outFileName = $match.Groups[0].value # What filename output is sent to?
if(-not $ht.ContainsKey($outFileName)) { # Output file is not yet in hashtable
$ht.Add($outFileName, (new-object Text.StringBuilder) )
[void]$ht[$outFileName].Append($header)
[void]$ht[$outFileName].Append([Environment]::NewLine)
} else { # Append data to existing file
[void]$ht[$outFileName].Append($line)
[void]$ht[$outFileName].Append([Environment]::NewLine)
}
}
}
$reader.Close() # Close the input file
# Dump the hashtable contents to individual files
$ht.GetEnumerator() | % {
set-content $($dst -f $_.Name) ($_.Value).ToString()
}
答案 1 :(得分:0)
根据Bob McCoy的帮助,这正是我所寻找的。 p>
# Split-File.ps1
$src = "C:\Ephemeral\bigfile.csv"
$dstDir = "C:\Ephemeral\files\"
# Delete previous output files
Remove-Item -Path "$dstDir\\*"
# Read input and create subordinate files based on column 8 content
$header = Get-Content -Path $src | select -First 1
Get-Content -Path $src | select -Skip 1 | foreach {
$file = "$(($_ -split ",")[7]).txt"
Write-Verbose "Wrting to $file"
if (-not (Test-Path -Path $dstDir\$file))
{
Out-File -FilePath $dstDir\$file -InputObject $header -Encoding ascii
}
Out-File -FilePath $dstDir\$file -InputObject $_ -Encoding ascii -Append
}
此代码存在小问题。花了将近80分钟将我的大文件拆分成1800个小文件,所以如果有人建议如何提高这段代码的性能,我们将非常感激。 Mayby有助于“bigfile”按字母#8排序。所有小文件的名称也存储在#8列中。