Question

我需要一种方法来将CSV文件中的分隔符从逗号更改为管道。由于CSV文件的大小（~750 Mb到几Gb），使用Import-CSV和/或Get-Content不是一种选择。我正在使用的（以及有效的，尽管很慢）是以下代码：

$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()
    $details = [ordered]@{
                            "Plugin ID" = $line[0]
                            CVE = $line[1]
                            CVSS = $line[2]
                            Risk = $line[3]     
                         }                        
    $export = New-Object PSObject -Property $details
    $export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"    
}

这个小循环花了将近2分钟来处理一个20 Mb的文件。以这个速度向上扩展意味着我正在使用的最小CSV文件超过一个小时。

我也试过这个：

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()  

    $details = [ordered]@{
                             # Same data as before
                         }

    $export.Add($details) | Out-Null        
}

$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"

这是更快，但没有在新CSV中提供正确的信息。取而代之的是我得到的行和行：

"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"

所以，有两个问题：

1）第一块代码能否更快？ 2）如何解开第二个例子中的arraylist来获取实际数据？

编辑：此处的示例数据 - http://pastebin.com/6L98jGNg

Answer 1

这是简单的文本处理，因此瓶颈应该是磁盘读取速度：在i7上测量的OP样品（重复到上述尺寸）每100 MB 1秒或每1GB 10秒。对于包含许多/所有小引号字段的文件，结果会更糟。

算法很简单：

以大字符串块的形式读取文件，例如1MB。
它比读取由CR / LF分隔的数百万行快得多，因为：
- 执行较少的检查，因为我们主要/主要只查看双引号;
- 解释器执行的代码迭代次数减少了。
找到下一个双引号。
根据当前$inQuotedField标志，确定找到的双引号是否开始引用字段（应该在,前面加上一些空格可选）或结束当前引用的字段（应该跟随任何偶数）双引号的数量，可选空格，然后是,）。
如果没有找到引号，则替换上一个范围中的分隔符或更换为1MB块的末尾。

代码做了一些合理的假设，但是如果在字段分隔符之前/之后跟随或者前面有超过3个空格，它可能无法检测到转义字段。这些检查不会太难添加，我可能会错过其他一些边缘案例，但我对此并不感兴趣。

$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM

$delim = [char]','
$newDelim = [char]'|'

$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
    $sourcePath,
    [IO.FileMode]::open,
    [IO.FileAccess]::read,
    [IO.FileShare]::read,
    $buf.length,  # let OS prefetch the next chunk in background
    [IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)

$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]@($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)

while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
    $s = [string]::new($buf, 0, $nRead+$bufStart)
    $len = $s.length
    $pos = 0
    $out.Clear() >$null

    do {
        $iQuote = $s.IndexOf([char]'"', $pos)
        if ($inQuotedField) {
            $iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
            if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
                # no closing quote in buffer safezone
                $out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
                break
            }
            if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
                # even number of quotes are just quoted quotes
                $inQuotedField = $matches[1].length % 2 -eq 0
            }
            $out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
            $pos = $iDelim + 1
            continue
        }
        if ($iQuote -ge 0) {
            $iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
            if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
                $inQuotedField = $true
            }
            $replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
        } elseif ($pos -gt 0) {
            $replaced = $s.Substring($pos).Replace($delim, $newDelim)
        } else {
            $replaced = $s.Replace($delim, $newDelim)
        }
        $out.Append($replaced) >$null
        $pos = $iQuote + 1
    } while ($iQuote -ge 0)

    $target.Write($out)

    $bufStart = 0
    for ($i = $out.length; $i -lt $s.length; $i++) {
        $buf[$bufStart++] = $buf[$i]
    }
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()

Answer 2

仍然不是我所说的快速，但这比使用-Join运算符所列出的要快得多：

$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData){
    $line = $reader.ReadFields()
    $line -join '|' | Add-Content C:\Temp\TestOutput.csv
}

在32秒内完成处理20MB文件。按照这个速度，你的750MB文件将在20分钟内完成，而更大的文件应该在每个演出约26分钟。

使用Powershell更改大型CSV文件中的分隔符

2 个答案: