Question

我有一个大的1000万行文件（当前为CSV）。我需要通读文件，并删除基于多列的重复项。

示例数据行类似于：

计算机名，IP地址，Mac地址，CurrentDate，FirstSeenDate

我想检查MacAddress和ComputerName是否存在重复项，如果发现重复项，则保留具有最早的FirstSeenDate的唯一条目。

我已经使用import-csv将CSV读入变量，然后使用sort-object ... etc处理了该变量，但速度非常慢。

true

我想我可以使用stream.reader并逐行读取CSV线，以基于包含逻辑的数组构建唯一的数组。

有想法吗？

Answer 1

您可以在数据库中进行导入（即SQLite example）然后查询：

SELECT 
  MIN(FirstSeenDate) AS FirstSeenDate, 
  ComputerName, 
  IPAddress, 
  MacAddress
FROM importedData
GROUP BY ComputerName, IPAddress, MacAddress

Answer 2

如果性能是主要问题，我可能会使用Python。或LogParser。

但是，如果必须使用PowerShell，我可能会尝试这样的事情：

$CultureInfo = [CultureInfo]::InvariantCulture
$DateFormat = 'M/d/yyyy' # Use whatever date format is appropriate

# We need to convert the strings that represent dates. You can skip the ParseExact() calls if the dates are already in a string sortable format (e.g., yyyy-MM-dd).
$Data = Import-Csv $InputFile | Select-Object -Property ComputerName, IPAddress, MacAddress, @{n = 'CurrentDate'; e = {[DateTime]::ParseExact($_.CurrentDate, $DateFormat, $CultureInfo)}}, @{n = 'FirstSeenDate'; e = {[DateTime]::ParseExact($_.FirstSeenDate, $DateFormat, $CultureInfo)}}

$Results = @{}
foreach ($Record in $Data) {
    $Key = $Record.ComputerName + ';' + $Record.MacAddress
    if (!$Results.ContainsKey($Key)) {
        $Results[$Key] = $Record
    }
    elseif ($Record.FirstSeenDate -lt $Results[$Key].FirstSeenDate) {
        $Results[$Key] = $Record
    }
}

$Results.Values | Sort-Object -Property ComputerName, MacAddress | Export-Csv $OutputFile -NoTypeInformation

这可能会更快，因为Group-Object即使很强大也常常是瓶颈。

如果您真的想尝试使用流阅读器，请尝试使用Microsoft.VisualBasic.FileIO.TextFieldParser class，尽管它的名称有些误导，但它是.Net框架的一部分。您可以通过运行Add-Type -AssemblyName Microsoft.VisualBasic来访问它。

在PowerShell中读取大CSV来解析多个列以获取唯一值，并根据列中最早的值保存结果

2 个答案: