如何在PowerShell中将一个大文本文件拆分为多个文件

时间:2014-08-20 04:35:45

标签: file powershell split

您好我有一个像这样的大文本文件

BIGFILE.TXT

COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
11/24/2013,50.67,51.22,50.67,51.12,17,0,FILE1
11/25/2013,51.34,51.91,51.09,51.87,23,0,FILE1
12/30/2013,51.76,51.82,50.86,51.15,13,0,FILE1
12/31/2013,51.15,51.33,50.45,50.76,18,0,FILE1
1/1/2014,50.92,51.58,50.84,51.1,19,0,FILE2
1/4/2014,51.39,51.46,50.95,51.21,14,0,FILE2
1/7/2014,51.08,51.2,49.84,50.05,35,0,FILE2
1/8/2014,50.14,50.94,50.01,50.78,100,0,FILE3
1/11/2014,50.63,51.41,50.52,51.3,190,0,FILE3
1/15/2014,54.03,55.74,53.69,54.93,110,0,FILE4
1/19/2014,53.67,54.19,53.55,53.82,24,0,FILE4
1/20/2014,53.83,54.26,53.47,53.53,23,0,FILE4
1/21/2014,53.8,54.55,53.7,54.1,24,0,FILE4
1/26/2014,53.26,53.93,53.23,53.65,31,0,FILE5
1/27/2014,53.78,54,53.64,53.81,110,0,FILE5

我正在寻找如何将此文件拆分为多个文本文件的方法。在这种情况下,一个文件将被拆分为5个文本文件。每个文本文件的名称将从第8列中获取。大文件以逗号分隔。所以输出将是:

FILE1.txt

COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
11/24/2013,50.67,51.22,50.67,51.12,17,0,FILE1
11/25/2013,51.34,51.91,51.09,51.87,23,0,FILE1
12/30/2013,51.76,51.82,50.86,51.15,13,0,FILE1
12/31/2013,51.15,51.33,50.45,50.76,18,0,FILE1

FILE2.TXT

COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
1/1/2014,50.92,51.58,50.84,51.1,19,0,FILE2
1/4/2014,51.39,51.46,50.95,51.21,14,0,FILE2
1/7/2014,51.08,51.2,49.84,50.05,35,0,FILE2

FILE3.TXT

COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8
1/8/2014,50.14,50.94,50.01,50.78,100,0,FILE3
1/11/2014,50.63,51.41,50.52,51.3,190,0,FILE3
.
.
.

大文本文件有几千行。 有人知道怎么做吗?

谢谢你的帮助。 学家

2 个答案:

答案 0 :(得分:3)

如果大文件有几千行,那就没那么大了,您可以使用Import-CSVExport-CSV来处理内容。

$big = Import-Csv big.csv
$big | ? { $_.column8 -eq "file1" } | Export-Csv -NoTypeInformation file1.csv

# Output
cat .\file1.csv
"COLUMN1","COLUMN2","COLUMN3","COLUMN4","COLUMN5","COLUMN6","COLUMN7","COLUMN8"
"11/24/2013","50.67","51.22","50.67","51.12","17","0","FILE1"
"11/25/2013","51.34","51.91","51.09","51.87","23","0","FILE1"
"12/30/2013","51.76","51.82","50.86","51.15","13","0","FILE1"
"12/31/2013","51.15","51.33","50.45","50.76","18","0","FILE1"

另一方面,如果文件太大而系统在Import-CSV上窒息,请使用IO.StreamReader()读取文件并逐行处理文件。

编辑:

哦,好几千个输出文件处理起来有点棘手。具有大量Add-Content的磁盘I / O是性能杀手,但对于单次操作,这样的事情应该起作用:

$src = "c:\temp\reallybig.csv"  # Source file
$dst = "c:\temp\file{0}.csv"    # Output file(s)
$reader = new-object IO.StreamReader($src)  # Reader for input

while(($line = $reader.ReadLine()) -ne $null){ # Loop the input
    $match = [regex]::match($line, "(?i)file(\d)") # Look for row that ends with file-and-number

    if($match.Success){
     # Add the line to respective output file. SLOW! 
     add-content $($dst -f $match.Groups[0].value) $line 
    }
}
$reader.Close() # Close the input file

为了提高性能,基于输出文件的StringBuilder缓冲非常有效。

EDIT2:

这是另一个版本。它包含一个包含StringBuilder对象的哈希表。最后一列中的每个输出文件名都用作键,其值是包含文本数据的StringBuilder。这种方法将所有输出文件数据存储在内存中,因此可以加速x64和几千兆字节的RAM,以获得相当大的输入文件。缓冲区可以立即刷新到磁盘,然后保存到内存中;这需要额外的簿记。

$src = "c:\temp\reallybig.csv"   # Source file
$dst = "c:\temp\file_{0}.csv"    # Output file(s)
$reader = new-object IO.StreamReader($src)  # Reader for input

$header = Get-Content -Path $src | select -First 1 # Get the header row

$ht = @{}
$line = $reader.ReadLine() # Skip the first line, it's alread in $header

while(($line = $reader.ReadLine()) -ne $null){ # Loop the input
    $match = [regex]::match($line, '(?i)(\w+\d)$') # Look for row that ends with file-and-number

    if($match.Success){

      $outFileName = $match.Groups[0].value # What filename output is sent to?

      if(-not $ht.ContainsKey($outFileName)) { # Output file is not yet in hashtable
        $ht.Add($outFileName, (new-object Text.StringBuilder) )
        [void]$ht[$outFileName].Append($header)
        [void]$ht[$outFileName].Append([Environment]::NewLine)
      } else { # Append data to existing file
        [void]$ht[$outFileName].Append($line)
        [void]$ht[$outFileName].Append([Environment]::NewLine)
      }
    }
}
$reader.Close() # Close the input file

# Dump the hashtable contents to individual files
$ht.GetEnumerator() | % { 
    set-content $($dst -f $_.Name) ($_.Value).ToString() 
} 

答案 1 :(得分:0)

根据Bob McCoy的帮助,这正是我所寻找的。

#  Split-File.ps1

$src = "C:\Ephemeral\bigfile.csv"
$dstDir = "C:\Ephemeral\files\"

# Delete previous output files
Remove-Item -Path "$dstDir\\*"

# Read input and create subordinate files based on column 8 content
$header = Get-Content -Path $src | select -First 1

Get-Content -Path $src | select -Skip 1 | foreach {
    $file = "$(($_ -split ",")[7]).txt"
    Write-Verbose "Wrting to $file"
    if (-not (Test-Path -Path $dstDir\$file))
    {
        Out-File -FilePath $dstDir\$file -InputObject $header -Encoding ascii
    }
    Out-File -FilePath $dstDir\$file -InputObject $_ -Encoding ascii -Append
}

此代码存在小问题。花了将近80分钟将我的大文件拆分成1800个小文件,所以如果有人建议如何提高这段代码的性能,我们将非常感激。 Mayby有助于“bigfile”按字母#8排序。所有小文件的名称也存储在#8列中。