如何使用PowerShell拆分文本文件?

时间:2009-06-16 14:15:44

标签: powershell

我需要将一个大的(500 MB)文本文件(一个log4net异常文件)拆分成可管理的块,例如100个5 MB的文件就可以了。

我认为这应该是PowerShell的公园散步。我该怎么办?

14 个答案:

答案 0 :(得分:49)

关于一些现有答案的警告 - 对于非常大的文件,它们运行速度非常慢。对于一个1.6 GB的日志文件,我在几个小时后就放弃了,意识到在我第二天回到工作岗位之前它还没有完成。

两个问题:对Add-Content的调用打开,搜索然后关闭源文件中每一行的当前目标文件。每次读取一些源文件并查找新行也会减慢速度,但我猜测Add-Content是主要元凶。

以下变体产生的输出稍差:它会在行的中间分割文件,但它会在不到一分钟的时间内分割出我的1.6 GB日志:

$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB


$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
    do {
        "Reading $upperBound"
        $count = $fromFile.Read($buff, 0, $buff.Length)
        if ($count -gt 0) {
            $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
            $toFile = [io.file]::OpenWrite($to)
            try {
                "Writing $count to $to"
                $tofile.Write($buff, 0, $count)
            } finally {
                $tofile.Close()
            }
        }
        $idx ++
    } while ($count -gt 0)
}
finally {
    $fromFile.Close()
}

答案 1 :(得分:40)

对于PowerShell来说,这是一项相当简单的任务,因为标准的Get-Content cmdlet不能很好地处理非常大的文件。我建议做的是使用.NET StreamReader class在PowerShell脚本中逐行读取文件,并使用Add-Content cmdlet将每行写入索引不断增加的文件中文件名。像这样:

$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"

$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line
    if((Get-ChildItem -path $fileName).Length -ge $upperBound)
    {
        ++$count
        $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
    }
}

$reader.Close()

答案 2 :(得分:34)

基于行数分割的简单单行(在这种情况下为100):

$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}

答案 3 :(得分:28)

与此处的所有答案相同,但使用StreamReader / StreamWriter拆分新行(逐行扫描,而不是尝试一次将整个文件读入内存)。这种方法可以用我所知的最快方式拆分大文件。

注意:我的错误检查很少,因此我无法保证它能够顺利地处理您的情况。它适用于我的( 1.7 GB TXT文件 400万行,每个文件分为100,000行 95秒)。

#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"

$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
    $reader = [io.file]::OpenText($filename)
    try{
        "Creating file number $filecount"
        $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
        $filecount++
        $linecount = 0

        while($reader.EndOfStream -ne $true) {
            "Reading $linesperFile"
            while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
                $writer.WriteLine($reader.ReadLine());
                $linecount++
            }

            if($reader.EndOfStream -ne $true) {
                "Closing file"
                $writer.Dispose();

                "Creating file number $filecount"
                $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
                $filecount++
                $linecount = 0
            }
        }
    } finally {
        $writer.Dispose();
    }
} finally {
    $reader.Dispose();
}
$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

分割1.7 GB文件的输出:

...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in  95.6308289 seconds

答案 4 :(得分:15)

我经常需要做同样的事情。诀窍是将标题重复到每个拆分块中。我编写了以下cmdlet(PowerShell v2 CTP 3),它可以解决问题。

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('c')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Count,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

        # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

            # get the input file in manageable chunks

            $Part = 1
            Get-Content $_ -ReadCount:$Count | %{

                # make an output filename with a suffix
                $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

                # In the first iteration the header will be
                # copied to the output file as usual
                # on subsequent iterations we have to do it
                if ($RepeatCount -and $Part -gt 1) {
                    Set-Content $OutputFile $Header
                }

                # write this chunk to the output file
                Write-Host "Writing $OutputFile"
                Add-Content $OutputFile $_

                $Part += 1

            }

        }

    }

}

答案 5 :(得分:14)

我在尝试将单个vCard VCF文件中的多个联系人拆分为单独的文件时发现了这个问题。这是我根据Lee的代码所做的。我不得不查找如何创建一个新的StreamReader对象并将null更改为$ null。

$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count) 

while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line

    if($line -eq "END:VCARD")
    {
        ++$count
        $filename = "C:\Contacts\{0}.vcf" -f ($count)
    }
}

$reader.Close()

答案 6 :(得分:6)

对于我的源文件,其中许多答案都太慢了。我的源文件是10 MB到800 MB之间的SQL文件,需要拆分成大致相等行数的文件。

我发现以前的一些使用Add-Content的答案非常慢。等待很长时间才能完成分割并不罕见。

我没有尝试Typhlosaurus's answer,但它看起来只按文件大小进行拆分,而不是行数。

以下内容符合我的目的。

$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length

Write-Host "Total Lines :" $totalLines

$skip = 0
$count = 100000; # Number of lines per file

# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")

while ($skip -le $totalLines) {
    $upper = $skip + $count - 1
    if ($upper -gt ($lines.Length - 1)) {
        $upper = $lines.Length - 1
    }

    # Write the lines
    [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])

    # Increment counters
    $skip += $count
    $fileNumber++
    $fileNumberString = $filenumber.ToString("000")
}

$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

对于54 MB的文件,我得到输出...

Reading source file...
Total Lines : 910030
Split complete in  1.7056578 seconds

我希望其他人能够找到符合我要求的简单的基于行的分割脚本,这会让我觉得很有用。

答案 7 :(得分:3)

还有这种快速(有点脏)的单线:

$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }

您可以通过更改硬编码的3000值来调整每批次的第一行数。

答案 8 :(得分:2)

我根据每个部分的大小对文件进行了一些修改。

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Size
# (Or -s) The maximum size of each file. Size must be expressed in MB.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv -s 20 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('s')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Size,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

  # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

   Resolve-Path @ResolveArgs | %{

    $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
    $InputExt  = [IO.Path]::GetExtension($_)

    if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

    # get the input file in manageable chunks

    $Part = 1
    $buffer = ""
    Get-Content $_ -ReadCount:1 | %{

     # make an output filename with a suffix
     $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

     # In the first iteration the header will be
     # copied to the output file as usual
     # on subsequent iterations we have to do it
     if ($RepeatCount -and $Part -gt 1) {
      Set-Content $OutputFile $Header
     }

     # test buffer size and dump data only if buffer is greater than size
     if ($buffer.length -gt ($Size * 1MB)) {
      # write this chunk to the output file
      Write-Host "Writing $OutputFile"
      Add-Content $OutputFile $buffer
      $Part += 1
      $buffer = ""
     } else {
      $buffer += $_ + "`r"
     }
    }
   }
        }
    }
}

答案 9 :(得分:2)

这样做:

文件1

还有这种快速(有点脏)的单线:

    $linecount=0; $i=0; 
    Get-Content .\BIG_LOG_FILE.txt | %
    { 
      Add-Content OUT$i.log "$_"; 
      $linecount++; 
      if ($linecount -eq 3000) {$I++; $linecount=0 } 
    }

您可以通过更改硬编码的3000值来调整每批次的第一行数。

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII

文件2

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII

文件3

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII

等...

答案 10 :(得分:1)

听起来像UNIX命令拆分的作业:

split MyBigFile.csv

只需不到10分钟即可将我的55 GB csv文件拆分为21k块。

虽然它不是PowerShell的原生,但是附带了,例如,windows包的git https://git-scm.com/download/win

答案 11 :(得分:0)

我的要求有点不同。我经常使用逗号分隔和制表符分隔的ASCII文件,其中单行是单个数据记录。而且它们非常大,所以我需要将它们分成可管理的部分(同时保留标题行)。

所以,我恢复了我的经典VBScript方法,并将一个小的.vbs脚本混合在一起,可以在任何Windows计算机上运行(它由Window上的WScript.exe脚本主机引擎自动执行)。

此方法的好处是它使用文本流,因此基础数据不会加载到内存中(或者至少不会一次加载到内存中)。结果是它非常快,并且它确实不需要太多内存来运行。我在i7上使用这个脚本分割的测试文件大小约为1 GB,有大约1200万行文本,分成25个部分文件(每个文件大约有500k行) - 处理大约需要2分钟它没有超过任何一点使用的3 MB内存。

这里需要注意的是,它依赖于具有" line"的文本文件。 (意思是每个记录用CRLF分隔),因为文本流对象使用" ReadLine"用于一次处理一行的功能。但是,嘿,如果你正在使用TSV或CSV文件,它是完美的。

Option Explicit

Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"  
Private Const REPEAT_HEADER_ROW = True                
Private Const LINES_PER_PART = 500000                 

Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart

sStart = Now()

sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1

Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)

If REPEAT_HEADER_ROW Then
    iLineCounter = 1
    sHeaderLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sHeaderLine)
End If

Do While Not oInputFile.AtEndOfStream
    sLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sLine)
    iLineCounter = iLineCounter + 1
    If iLineCounter Mod LINES_PER_PART = 0 Then
        iOutputFile = iOutputFile + 1
        Call oOutputFile.Close()
        Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
        If REPEAT_HEADER_ROW Then
            Call oOutputFile.WriteLine(sHeaderLine)
        End If
    End If
Loop

Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing

Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())

答案 12 :(得分:0)

由于日志中的行可以变量,我认为最好每个文件方法采用多行。以下代码片段在19秒(18.83秒)内处理了400万行日志文件,将其分成500,000行块:

$sourceFile = "c:\myfolder\mylargeTextyFile.csv"
$partNumber = 1
$batchSize = 500000
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"

[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001)  # utf8 this one

$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None") 
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename

$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
    $streamout.writeline($line)
    $counter +=1
    if ($counter -eq $batchsize)
    {
        $partNumber+=1
        $counter =0
        $streamOut.close()
        $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
        $streamout = new-object System.IO.StreamWriter $pathAndFilename

    }
    $line = $streamIn.readline()
}
$streamin.close()
$streamout.close()

这可以很容易地转换为带参数的函数或脚本文件,以使其更加通用。它使用StreamReaderStreamWriter来实现其速度和微小的内存占用

答案 13 :(得分:0)

这是我将一个名为patch6.txt(大约32,000行)的文件拆分为每个1000行的单独文件的解决方案。它并不快,但它确实起到了作用。

$infile = "D:\Malcolm\Test\patch6.txt"
$path = "D:\Malcolm\Test\"
$lineCount = 1
$fileCount = 1

foreach ($computername in get-content $infile)
{
    write $computername | out-file -Append $path_$fileCount".txt"
    $lineCount++

    if ($lineCount -eq 1000)
    {
        $fileCount++
        $lineCount = 1
    }
}