基于与第一列匹配合并2个csv文件的更快方法

时间:2016-01-04 23:51:02

标签: powershell powershell-v3.0

目前,

我正在尝试合并两个csv文件。第一个文件大约有3000多行。第二个文件大约有400,000行。

为了测试这个,我正在使用这两个......

第一个csv文件:

Csv1ColumnOne,Csv1ColumnTwo,Csv1ColumnThree,Csv1ColumnFour
1234,Value1,Value1,Value1
2345,Value2,Value1,Value1
3456,Value1,Value2,Value1
4567,Value1,Value1,Value2
7645,Value3,Value3,Value3

第二个csv文件:

Csv2ColumnOne,Csv2ColumnTwo,Csv2ColumnThree
1234,abc,Value1
2345,asd,Value1
3456,qwe,Value1
4567,mnb,Value1

最终结果文件应如下所示:

"Csv1ColumnOne","Csv1ColumnTwo","Csv1ColumnThree","Csv1ColumnFour","Csv2ColumnOne"
"1234","Value1","Value1","Value1","abc"
"2345","Value2","Value1","Value1","asd"
"3456","Value1","Value2","Value1","qwe"
"4567","Value1","Value1","Value2","mnb"
"7645","Value3","Value3","Value3","Not Found"

以下是我现在的代码(目前正在使用):

Function GetFirstColumnNameFromFile
{
    Param ($CsvFileWithPath)

    $FirstFileFirstColumnTitle = ((Get-Content $CsvFileWithPath -TotalCount 2 | ConvertFrom-Csv).psobject.properties | ForEach-Object {$_.name})[0]
    Write-Output $FirstFileFirstColumnTitle
}

Function CreateMergedFileWithCsv2ColumnOneColumn
{
    Param ($firstColumnFirstFile, $FirstFileFirstColumnTitle, $firstFile, $secondFile, $resultsFile)

    Write-Host "Creating hash table with columns values `"Csv2ColumnOne`" `"Csv2ColumnTwo`" From $secondFile"
    $hashColumnOneColumnTwo2ndFile = @{}
    Import-Csv $secondFile | Where-Object {$firstColumnFirstFile -contains $_.'Csv2ColumnOne'} | ForEach-Object {$hashColumnOneColumnTwo2ndFile[$_.'Csv2ColumnOne'] = $_.Csv2ColumnTwo}
    Write-Host "Complete."

    Write-Host "Creating Merge file with file $firstFile
    and column `"Csv2ColumnTwo`" from file $secondFile"
    Import-Csv $firstFile | Select-Object *, @{n='Csv2ColumnOne'; e={
    if ($hashColumnOneColumnTwo2ndFile.ContainsKey($_.$FirstFileFirstColumnTitle)) {
        $hashColumnOneColumnTwo2ndFile[$_.$FirstFileFirstColumnTitle]
    } Else {
        'Not Found'
    }}} | Export-Csv $resultsFile -NoType
    Write-Host "Complete."
}

Function MatchFirstTwoColumnsTwoFilesAndCombineOtherColumnsOneFile
{
    Param ($firstFile, $secondFile, $resultsFile)

    [string]$FirstFileFirstColumnTitle = GetFirstColumnNameFromFile $firstFile

    $FirstFileFirstColumn = Import-Csv $firstFile | Where-Object {$_.$FirstFileFirstColumnTitle} | Select-Object -ExpandProperty $FirstFileFirstColumnTitle

    CreateMergedFileWithCsv2ColumnOneColumn $FirstFileFirstColumn $FirstFileFirstColumnTitle $firstFile $secondFile $resultsFile
}

Function Main
{
    $firstFile = 'C:\Scripts\Tests\test1.csv'
    $secondFile = 'C:\Scripts\Tests\test2.csv'
    $resultsFile = 'C:\Scripts\Tests\testResults.csv'

    MatchFirstTwoColumnsTwoFilesAndCombineOtherColumnsOneFile $firstFile $secondFile $resultsFile
}

Main

对于以下行:

Import-Csv $secondFile | Where-Object {$firstColumnFirstFile -contains $_.'Csv2ColumnOne'} | ForEach-Object {$hashColumnOneColumnTwo2ndFile[$_.'Csv2ColumnOne'] = $_.Csv2ColumnTwo}

大约需要30分钟(每列 - 每列10列)。这意味着在2个csv文件之间合并3,000行需要大约5-7个小时(当我添加代码以在最终结果文件中添加其他列时)。有没有更快的方法从第二个文件创建超过400,000行的哈希表?

2 个答案:

答案 0 :(得分:2)

看看这是否会更快地构建哈希表:

$ht = @{}
Get-Content test1.csv -ReadCount 1000 |
foreach { 
 $ht += convertfrom-stringdata $($_ -replace '"?(.+?)"?,"?(.+?)"?,.+','$1=$2' | out-string)
 }

答案 1 :(得分:0)

我不是百分百肯定我正在关注你的问题 - 但是我对你的测试文件进行了以下操作:

$file1 = Import-Csv .\file1.csv
$file2 = Import-Csv .\file2.csv

$file1 | ForEach-Object {
    $f1 = $_
    $f1 | Add-Member -MemberType NoteProperty -Name csv2columnone -Value "" 
    $file2 | ForEach-Object {
        if($f1.csv1columnone -eq $_.csv2columnone) {
            if($_.csv2columntwo -ne $null) {
                $f1.csv2columnone = $_.csv2columntwo
            }
        } 
    }
    if([String]::IsNullOrEmpty($f1.csv2columnone)) {
        $f1.csv2columnone = "Not found"
    }
    Write-Output $f1
} | ft

得到了结果:

    Csv1ColumnOne Csv1ColumnTwo Csv1ColumnThree Csv1ColumnFour csv2columnone
------------- ------------- --------------- -------------- -------------
1234          Value1        Value1          Value1         abc          
2345          Value2        Value1          Value1         asd          
3456          Value1        Value2          Value1         qwe          
4567          Value1        Value1          Value2         mnb          
7645          Value3        Value3          Value3         Not found    

运行measure-command(运行时间)导致运行时间为20毫秒。