Question

我有两个文件，file1和file2。我需要检查file1中是否存在file2中的所有内容。 file1的内容如下：

ABC1234
BFD7890

file2的内容如下：

ABC1234_20180902_XYZ
BFD7890_20110890_123

它们将没有任何特定的顺序，并且不可能用定界符进行拆分，因为它们在不同的行中是不同的。我唯一需要确认的是，file1的某个部分中是否存在来自file2的字符串。不会有两次出现相同的模式。

两个文件都包含2万多行。

这是我目前拥有的：

$filesfromDB   = gc file1.txt
$filesfromSFTP = gc file2.txt
foreach ($f in $filesfromDB) {
    $FilePresentStatus = $filesfromSFTP | Select-String -Quiet -Pattern $f
    if ($FilePresentStatus -ne $true) {
        $MissingFiles += $f
    }
}

如果文件很小，这很好用，但是当我在prod中运行它时，它确实很慢。完成此循环大约需要4个小时。如何优化这段脚本？

Answer 1

20000并不多，但最坏的情况是您必须执行20000x20000 = 400000000运算。关键是要尽快停止每个。您也可以使用更快的[string].Contains方法来代替基于正则表达式的Select-String（除非使用-SimpleMatch开关）。

请参阅以下演示：

$db =   1000000..1020000
$sftp = (1001000..1021000 | % { "$($_)_SomeNotImportantTextHere" }) -join "`r`n"

$missingFiles = $db | where { !$sftp.Contains($_) }

每个集合包含20000个项目，共19000个项目，仅$db中存在1000个项目。它会在几秒钟内运行。

要将$filesfromSFTP读为一个大字符串，请使用：

gc file2.txt -Raw

要将结果转换为单个字符串，请使用$missingFiles -join 'separator'。

Answer 2

我认为您的问题出在+ =运算符上，请尝试执行此操作 https://powershell.org/2013/09/16/powershell-performance-the-operator-and-when-to-avoid-it/

Answer 3

使用哈希表，下面的代码在我的笔记本电脑上包含2个包含20000行的文件上大约需要15分钟。

$filesfromDB   = gc file1.txt
$filesfromSFTP = gc file2.txt
$MissingFiles  = @()
$hashtbl       = @{}

foreach ($f in $filesfromDB) {
    $hashtbl."Line$($f.ReadCount)"=[regex]$f
}

foreach ($key in $hashtbl.Keys) {
    $FilePresentStatus = $hashtbl[$key].Matches($filesfromSFTP)
    if ($FilePresentStatus.Count -eq 0) {
        $MissingFiles += $hashtbl[$key].ToString()
    }
}

优化循环浏览文件内容

3 个答案: