Question

此代码返回两个文件之间的唯一和共享行。不幸的是，如果文件有100万行，它会永远运行。是否有更快的方法（例如-eq, -match, wildcard, Compare-Object）或遏制运算符是最佳方法？

$afile = Get-Content (Read-Host "Enter 'A' file")
$bfile = Get-Content (Read-Host "Enter 'B' file")

$afile |
  ? { $bfile -notcontains $_ } |
  Set-Content lines_ONLY_in_A.txt

$bfile |
  ? { $afile -notcontains $_ } |
  Set-Content lines_ONLY_in_B.txt

$afile |
  ? { $bfile -contains $_ } |
  Set-Content lines_in_BOTH_A_and_B.txt

Answer 1

正如我在回答您之前的问题时提到的，-contains是一个缓慢的操作，特别是对于大型数组。

对于完全匹配，您可以使用Compare-Object并通过旁边指示符区分输出：

Compare-Object $afile $bfile -IncludeEqual | ForEach-Object {
    switch ($_.SideIndicator) {
        '<=' { $_.InputObject | Add-Content 'lines_ONLY_in_A.txt' }
        '=>' { $_.InputObject | Add-Content 'lines_ONLY_in_B.txt' }
        '==' { $_.InputObject | Add-Content 'lines_in_BOTH_A_and_B.txt' }
    }
}

如果仍然太慢，请尝试将每个文件读入哈希表：

$afile = Get-Content (Read-Host "Enter 'A' file")
$ahash = @{}
$afile | ForEach-Object {
    $ahash[$_] = $true
}

并处理这样的文件：

$afile | Where-Object {
    -not $bhash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_A.txt'

如果仍然没有帮助，您需要确定瓶颈（读取文件，比较数据，进行多重比较，......）并从那里开始。

Answer 2

试试这个：

$All=@()
$All+= Get-Content "c:\temp\a.txt" | %{[pscustomobject]@{Row=$_;File="A"}}
$All+= Get-Content "c:\temp\b.txt" | %{[pscustomobject]@{Row=$_;File="B"}}
$All | group row | %{

$InA=$_.Group.File.Contains("A")
$InB=$_.Group.File.Contains("B")

if ($InA -and $InB)
{
    $_.Group.Row | select -unique | Out-File c:\temp\lines_in_A_And_B.txt -Append
}
elseif ($InA)
{
   $_.Group.Row | select -unique | Out-File c:\temp\lines_Only_A.txt  -Append
}
else
{
   $_.Group.Row | select -unique | Out-File c:\temp\lines_Only_B.txt -Append
}


}

Answer 3

考虑到我建议binary search的建议，我为此创建了一个可重复使用的Search-SortedArray函数：

描述

Search（别名$Null）（二进制）搜索已排序数组中的字符串。如果找到该字符串，则返回该数组中找到的字符串的索引。否则，如果找不到该字符串，则返回Function Search-SortedArray ([String[]]$SortedArray, [String]$Find, [Switch]$CaseSensitive) { $l = 0; $r = $SortedArray.Count - 1 While ($l -le $r) { $m = [int](($l + $r) / 2) Switch ([String]::Compare($find, $SortedArray[$m], !$CaseSensitive)) { -1 {$r = $m - 1} 1 {$l = $m + 1} Default {Return $m} } } }; Set-Alias Search Search-SortedArray $afile | ? {(Search $bfile $_) -eq $Null} | Set-Content lines_ONLY_in_A.txt $bfile | ? {(Search $afile $_) -eq $Null} | Set-Content lines_ONLY_in_B.txt $afile | ? {(Search $bfile $_) -ne $Null} | Set-Content lines_in_BOTH_A_and_B.txt。

$afile |? { $bfile -notcontains $_ }

注1 ：由于开销，二进制搜索只会带来（非常）大数组的优势。

注2 ：必须对数组进行排序，否则结果将无法预测。

Nate 3 ：搜索没有考虑重复。如果有重复值，则只返回一个索引（这不是该特定问题的关注点）

根据@Ansgar Wiechers的评论添加2017-11-07：

快速基准测试，包含2个文件，每个文件有几千行（包括重复行）：二进制搜索：2400ms; compare-object：1850ms;哈希表查找：250ms

这个想法是binary search从长远来看会占据优势：数组越大，它的比例增益就越大。

以$bfile为例，评论中的表现测量和“几千行”是3000行：

对于标准搜索，$bfile：^{* 1}平均需要1500次迭代
```
(3000 + 1) / 2 = 3001 / 2 = 1500
```
对于二进制搜索，$afile中平均需要6.27次迭代：
```
(log₂ 3000 + 1) / 2 = (11.55 + 1) / 2 = 6.27
```

在这两种情况下，您都会执行3000次（对于250ms / 1500 / 3000 = 56 nanoseconds中的每个项目）这意味着每次迭代都需要：

对于标准搜索：2400ms / 6.27 / 3000 = 127482 nanoseconds
对于二进制搜索：{{1}}

盈亏平衡点将在约：

56 * ((x + 1) / 2 * 3000) = 127482 * ((log₂ x + 1) / 2 * 3000)

根据我的计算，大约 40000 条目。

_{* 1假设哈希表查找本身不进行二进制搜索，因为它不知道数组已排序}

已添加2017-11-07

评论中的

结论：哈希表似乎具有类似的associative array算法，这些算法无法通过低级编程命令表现出色。

Answer 4

最佳选择的完整代码（@ ansgar-wiechers）。唯一的B唯一和A，B共享线：

$afile = Get-Content (Read-Host "Enter 'A' file")
$ahash = @{}
$afile | ForEach-Object {
    $ahash[$_] = $true
}

$bfile = Get-Content (Read-Host "Enter 'B' file")
$bhash = @{}
$bfile | ForEach-Object {
    $bhash[$_] = $true
}

$afile | Where-Object {
    -not $bhash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_A.txt'

$bfile | Where-Object {
    -not $ahash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_B.txt'

$afile | Where-Object {
    $bhash.ContainsKey($_)
} | Set-Content 'lines_in _BOTH_A_and_B.txt'

来自大文件的共享和唯一行。最快的方法？

4 个答案:

描述