Question

我有标准的Apache日志文件，大小在500Mb到2GB之间。我需要对它们中的行进行排序（每行以日期yyyy-MM-dd hh：mm：ss开头，因此不需要进行排序处理。

想到的最简单，最明显的事情是

 Get-Content unsorted.txt | sort | get-unique > sorted.txt

我猜测（没有尝试过）使用Get-Content执行此操作将永远占用我的1GB文件。我不太了解System.IO.StreamReader，但我很好奇是否可以使用它来组合有效的解决方案？

感谢任何想要更有效率的人。

[编辑]

我后来试了这个，花了很长时间;大约10分钟400MB。

Answer 1

Get-Content对于阅读大文件非常无效。 Sort-Object也不是很快。

让我们设置一个基线：

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

如果一个40 MB的文件有160万行（由100k个独特行重复16次组成），这个脚本会在我的机器上产生以下输出：

Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663

完全不起眼：超过6分钟来排序小文件。每一步都可以改进很多。让我们使用StreamReader逐行读取文件到HashSet，这将删除重复项，然后将数据复制到List并在那里排序，然后使用StreamWriter转储结果。

$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
    while (($line = $reader.ReadLine()) -ne $null)
    {
        $t = $hs.Add($line)
    }
}
finally {
    $reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);

$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
    $f = New-Object System.IO.StreamWriter "d:\result2.txt";
    foreach ($s in $ls)
    {
        $f.WriteLine($s);
    }
}
finally
{
    $f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);

这个脚本产生：

read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802

在相同的输入文件上，它的运行速度提高了10倍以上。我仍然感到惊讶，虽然从磁盘读取文件需要30秒。

Answer 2

（根据第0条评论编辑得更清楚）

这可能是一个记忆问题。由于您将整个文件加载到内存中以对其进行排序（并将管道的开销添加到Sort-Object并将管道添加到Get-Unique），因此您可能会遇到机器的内存限制并强制它转到磁盘，这将减慢很多事情。您可能会考虑的一件事是在对日志进行排序之前将其拆分，然后将它们拼接在一起。

这可能不会完全匹配您的格式，但如果我有一个大型日志文件，例如，2012年8月16日跨越几个小时，我可以将其拆分为不同的文件每小时使用这样的东西：

for($i=0; $i -le 23; $i++){ Get-Content .\u_ex120816.log | ? { $_ -match "^2012-08-16 $i`:" } | Set-Content -Path "$i.log" }

这是为当天的每个小时创建一个正则表达式，并将所有匹配的日志条目转储到由小时命名的较小日志文件中（例如16.log，17.log）。

然后我可以运行您的排序过程并在更小的子集上获取唯一条目，这应该会更快地运行：

 for($i=0; $i -le 23; $i++){ Get-Content "$i.log" | sort | get-unique > "$isorted.txt" }

然后你可以将它们拼接在一起。

根据日志的频率，将它们按天或分钟分割可能更有意义;最主要的是让他们进入更易于管理的排序块。

同样，只有当您达到机器的内存限制时（或者如果Sort-Object使用的是非常低效的算法），这才有意义。

Answer 3

如果日志的每一行都带有时间戳前缀，并且日志消息不包含嵌入的换行符（这需要特殊处理），我认为从{{转换时间戳需要更少的内存和执行时间排序前1}}到cscript 32416311-2.vbs 45 2 12345 16 "2^33" 16 -45 2 "2^50" 8 "2^50*-1" 32 "&HFF" 10 45 2 ==> 101101 <== 45 ? True 101101 12345 16 ==> 3039 <== 12345 ? True 3039 8589934592 16 ==> 200000000 <== 8589934592 ? True Overflow -45 2 ==> -101101 <== -45 ? True Invalid procedure call or argument 1,12589990684262E+15 8 ==> 40000000000000000 <== 1,12589990684262E+15 ? True Overflow -1,12589990684262E+15 32 ==> -10000000000 <== -1,12589990684262E+15 ? True Overflow 255 10 ==> 255 <== 255 ? True 255。以下假设每个日志条目的格式为[String]（请注意，HH format specifier用于24小时制）：

[DateTime]

如果您正在处理输入文件以进行交互式显示，则可以将上述内容导入yyyy-MM-dd HH:mm:ss: <Message>或Get-Content unsorted.txt | ForEach-Object { # Ignore empty lines; can substitute with [String]::IsNullOrWhitespace($_) on PowerShell 3.0 and above if (-not [String]::IsNullOrEmpty($_)) { # Split into at most two fields, even if the message itself contains ': ' [String[]] $fields = $_ -split ': ', 2; return New-Object -TypeName 'PSObject' -Property @{ Timestamp = [DateTime] $fields[0]; Message = $fields[1]; }; } } | Sort-Object -Property 'Timestamp', 'Message';以查看结果。如果您需要保存排序结果，可以将上述内容输入以下内容：

Out-GridView

Answer 4

我已经讨厌Windows powershell的这一部分了，它是这些较大文件上的内存。一种技巧是读取行[System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii

另一个重要的技巧是仅使用linux。

cat file.txt | sort -u > output.txt

Linux如此之快如此之快，这让我想知道微软对于此设置的想法。

这可能并非在所有情况下都可行，据我了解，但是如果您有一台Linux机器，则可以将500兆复制到其中，对其进行排序和唯一化，然后在几分钟之内将其复制回来。

Answer 5

“获取内容”的速度可能比您想象的要快。除了上述解决方案之外，还请检查以下代码段：

foreach ($block in (get-content $file -ReadCount 100)) {
    foreach ($line in $block){[void] $hs.Add($line)}
}

Answer 6

在powershell中似乎没有一个很好的方法，包括[IO.File]::ReadLines()，但是使用本机windows sort.exe或gnu sort.exe，无论是在cmd.exe中，3000万随机使用大约 1 GB 的内存，可以在大约 5 分钟内对数字进行排序。 gnu 排序会自动将内容分解为临时文件以保存 ram。这两个命令都可以选择在某个字符列开始排序。 Gnu sort 可以合并排序后的文件。见external sorting。

3000 万行测试文件：

& { foreach ($i in 1..300kb) { get-random } } | set-content file.txt

然后在 cmd 中：

copy file.txt+file.txt file2.txt
copy file2.txt+file2.txt file3.txt
copy file3.txt+file3.txt file4.txt
copy file4.txt+file4.txt file5.txt
copy file5.txt+file5.txt file6.txt
copy file6.txt+file6.txt file7.txt
copy file7.txt+file7.txt file8.txt

使用 http://gnuwin32.sourceforge.net/packages/coreutils.htm 中的 gnu sort.exe。不要忘记依赖 dll 的 -- libiconv2.dll 和 libintl3.dll。在 cmd.exe 中：

.\sort.exe < file8.txt > filesorted.txt

或者 cmd.exe 中的 windows sort.exe：

sort.exe < file8.txt > filesorted.txt

在PowerShell中对非常大的文本文件进行排序

6 个答案: