Question

所以我现在的任务是获取持续的报告，这些报告的长度超过100万行。

我的最后一个问题并不能解释所有问题，所以我想尝试做一个更好的问题。

我收到了十几个每日报告，这些报告以CSV文件的形式出现。我不知道标头是什么，或者在获得标头之类的东西之类的。

它们很大。我无法在excel中打开。

我想将它们基本上分解成同一份报告，只是每份报告可能长100,000行。

下面写的代码不起作用，因为我不断得到

Exception of type 'System.OutOfMemoryException' was thrown.

我猜我需要一个更好的方法来做到这一点。

我只需要将此文件分解为更易于管理的大小。我可以整夜运行多长时间都没关系。

我在互联网上找到了它，并尝试对其进行操作，但是我无法使其正常工作。

$PSScriptRoot

write-host $PSScriptRoot

$loc = $PSScriptRoot

$location = $loc

# how many rows per CSV?
$rowsMax = 10000; 

# Get all CSV under current folder
$allCSVs = Get-ChildItem "$location\Split.csv"


# Read and split all of them
$allCSVs | ForEach-Object {
    Write-Host $_.Name;
    $content = Import-Csv "$location\Split.csv"
    $insertLocation = ($_.Name.Length - 4);
    for($i=1; $i -le $content.length ;$i+=$rowsMax){
    $newName = $_.Name.Insert($insertLocation, "splitted_"+$i)
    $content|select -first $i|select -last $rowsMax | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $location\$newName -fo -en ascii
    }
}

Answer 1

关键是不要将大文件完全读入内存 ，这就是通过捕获Import-Csv中< em>变量（$content = Import-Csv "$location\Split.csv"）。

也就是说，虽然使用单个管道可以解决您的内存问题，但性能可能会很差，因为您要从CSV转换回CSV，这会导致很多开销。

即使使用Get-Content和Set-Content以文本形式读取文件也很慢。
因此，我建议一种基于.NET的方法将文件处理为文本，这应该可以大大加快处理速度。

以下代码演示了此技术：

Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {

    $csvFile = $_.FullName

    # Construct a file-path template for the sequentially numbered chunk
    # files; e.g., "...\file_split_001.csv"
    $csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'

    # Set how many lines make up a chunk.
    $chunkLineCount = 10000

    # Read the file lazily and save every chunk of $chunkLineCount
    # lines to a new file.
    $i = 0; $chunkNdx = 0
    foreach ($line in [IO.File]::ReadLines($csvFile)) {
        if ($i -eq 0) { ++$i; $header = $line; continue } # Save header line.
        if ($i++ % $chunkLineCount -eq 1) { # Create new chunk file.
            # Close previous file, if any.
            if (++$chunkNdx -gt 1) { $fileWriter.Dispose() }

            # Construct the file path for the next chunk, by
            # instantiating the template with the next sequence number.
            $csvFileChunk = $csvFileChunkTemplate -f $chunkNdx
            Write-Verbose "Creating chunk: $csvFileChunk"

            # Create the next chunk file and write the header.
            $fileWriter = [IO.File]::CreateText($csvFileChunk)
            $fileWriter.WriteLine($header)
        }
        # Write a data row to the current chunk file.
        $fileWriter.WriteLine($line)
    }
    $fileWriter.Dispose() # Close the last file.

}

请注意，以上代码创建了无BOM的UTF-8文件；如果您的输入仅包含ASCII范围的字符，则这些文件实际上将是ASCII文件。

这是等效的单管道解决方案，它可能要慢得多。

Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {

    $csvFile = $_.FullName

    # Construct a file-path template for the sequentially numbered chunk
    # files; e.g., ".../file_split_001.csv"
    $csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'

    # Set how many lines make up a chunk.
    $chunkLineCount = 10000

    $i = 0; $chunkNdx = 0
    Get-Content -LiteralPath $csvFile | ForEach-Object {
        if ($i -eq 0) { ++$i; $header = $_; return } # Save header line.
        if ($i++ % $chunkLineCount -eq 1) { # 
            # Construct the file path for the next chunk.
            $csvFileChunk = $csvFileChunkTemplate -f ++$chunkNdx
            Write-Verbose "Creating chunk: $csvFileChunk"
            # Create the next chunk file and write the header.
            Set-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $header
        }
        # Write data row to the current chunk file.
        Add-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $_
    }

}

Answer 2

Linux世界中的另一个选项-split命令。要在Windows上安装它，只需安装git bash，然后您就可以在CMD / powershell中使用许多linux工具。以下是实现目标的语法：

console.log('Error: %s', err)

非常快。如果需要，可以将split.exe包装为cmdlet

Powershell按行数分解CSV

2 个答案: