以下脚本可以正常工作以获取所需的输出,但是处理大型XML文件(2GB及以上)需要花费很长时间。呼吁专家提出有关如何通过多线程或在Powershell脚本中使用其他技术使其更快的建议。
参考文章-进一步了解以下脚本的逻辑: Parse XML to extract data with grouping in PowerShell
# Create XML object to load data into
$xml = New-Object -TypeName System.Xml.XmlDocument
# Load in XML file
$xml.Load("test.xml")
# Group XML child nodes by Priority
$groups = $xml.'ABC-FOF-PROCESS'.ChildNodes | Group-Object -Property PRIORITY
# Iterate groups and create PSCustomObject for each grouping
& {
foreach ($group in $groups)
{
[PSCustomObject]@{
PRIORITY = [int]$group.Name
KEY = ($group.Group.KEY | Select-Object -Unique).Count
HITS = $group.Count
}
}
} | Sort-Object -Property PRIORITY -Descending | Out-File -FilePath output.txt
# Pipe output here
输出:
PRIORITY KEY HITS
-------- --- ----
1 1 1
-3 2 2
-14 2 3
xml:
<ABC-FOF-PROCESS>
<H>
<PRIORITY>-14</PRIORITY>
<KEY>F637A146-3437AB82-BA659D4A-17AC7FBF</KEY>
</H>
<H>
<PRIORITY>-14</PRIORITY>
<KEY>F637A146-3437AB82-BA659D4A-17AC7FBF</KEY>
</H>
<H>
<PRIORITY>-3</PRIORITY>
<KEY>D6306210-CF424F11-8E2D3496-E6CE1CA7</KEY>
</H>
<H>
<PRIORITY>1</PRIORITY>
<KEY>D6306210-CF424F11-8E2D3496-E6CE1CA7</KEY>
</H>
<H>
<PRIORITY>-3</PRIORITY>
<KEY>4EFR02B4-ADFDAF12-3C123II2-ADAFADFD</KEY>
</H>
<H>
<PRIORITY>-14</PRIORITY>
<KEY>5D2702B2-ECE8F1FB-3CEC3229-5FE4C4BC</KEY>
</H>
</ABC-FOF-PROCESS>
答案 0 :(得分:2)
如果xml是固定格式,则可以逐行读取文件并随时调整结果。
它不是并行的,它不如使用xml解析功能强大,并且不会赢得任何美人奖,但它应该很快。
$hits = @{} # Hashtable containing number of hits per priority
$keys = @{} # Hashtable containing unique keys per priority
switch -Regex -File $env:temp\test.xml
{
'^\s+<PRIORITY>(?<priority>[-]?\d+)'
{
$currentPriority = $matches.Priority
$hits[$currentPriority] = $hits[$currentPriority]+1
continue
}
'^\s+<KEY>(?<key>[\w-]+)'
{
$currentKey = $matches.Key
if ($keys[$currentPriority] -eq $null) {$keys[$currentPriority] = @{}}
$keys[$currentPriority][$currentKey] = $null
}
}
$hits.GetEnumerator() | % {
[PSCustomObject]@{
PRIORITY = [int]$_.Key
KEY = $keys[$_.Key].Count
HITS = [int]$_.Value
}
} | Sort PRIORITY -Descending
在500MB xml上进行了测试
PRIORITY KEY HITS
-------- --- ----
1 1 1000000
-3 2 2000000
-14 2 3000000
$timer
IsRunning Elapsed ElapsedMilliseconds ElapsedTicks
--------- ------- ------------------- ------------
False 00:02:25.7186698 145718 413249113
答案 1 :(得分:1)
我猜这是一个集中于单个命令(Runtime of Foreach-Object vs Foreach loop)而不是完整解决方案的示例。
通常,我建议您查看整个解决方案,而不仅仅是the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts的单个语句。
在您的情况下,如果仅由于要使用Foreach
语句而需要实例化脚本并使用Call Operator &
来调用该脚本,则可能会超出目标:
对于您提供的小文件,
(将管道与ForEach-Object
一起使用):
$groups | ForEach-Object {
[PSCustomObject]@{
PRIORITY = [int]$_.Name
KEY = ($_.Group.KEY | Select-Object -Unique).Count
HITS = $_.Count
}
} | Sort-Object -Property PRIORITY -Descending # | Out-File -FilePath output.txt
通常以比这更快的速度出现(使用ForEach
语句和Call运算符):
& {
foreach ($group in $groups)
{
[PSCustomObject]@{
PRIORITY = [int]$group.Name
KEY = ($group.Group.KEY | Select-Object -Unique).Count
HITS = $group.Count
}
}
} | Sort-Object -Property PRIORITY -Descending | Out-File -FilePath output.txt
由于Sort-Object
cmdlet的本质(要求所有对象都能够对其进行排序),因此它需要暂停管道以对其进行重新排序,出于相同的原因,多线程方法可能不会很有道理。