多线程PowerShell脚本可从大型XML文件更快地提取数据

时间:2020-04-22 05:51:44

标签: powershell

以下脚本可以正常工作以获取所需的输出,但是处理大型XML文件(2GB及以上)需要花费很长时间。呼吁专家提出有关如何通过多线程或在Powershell脚本中使用其他技术使其更快的建议。

参考文章-进一步了解以下脚本的逻辑: Parse XML to extract data with grouping in PowerShell

# Create XML object to load data into
$xml = New-Object -TypeName System.Xml.XmlDocument

# Load in XML file
$xml.Load("test.xml")

# Group XML child nodes by Priority
$groups = $xml.'ABC-FOF-PROCESS'.ChildNodes | Group-Object -Property PRIORITY

# Iterate groups and create PSCustomObject for each grouping
& {
    foreach ($group in $groups)
    {
        [PSCustomObject]@{
            PRIORITY = [int]$group.Name
            KEY = ($group.Group.KEY | Select-Object -Unique).Count
            HITS = $group.Count
        }
    }
} | Sort-Object -Property PRIORITY -Descending | Out-File -FilePath output.txt
# Pipe output here

输出:

PRIORITY KEY HITS
-------- --- ----
       1   1    1
      -3   2    2
     -14   2    3

xml:

<ABC-FOF-PROCESS>
<H>
 <PRIORITY>-14</PRIORITY>
 <KEY>F637A146-3437AB82-BA659D4A-17AC7FBF</KEY>
</H>
<H>
 <PRIORITY>-14</PRIORITY>
 <KEY>F637A146-3437AB82-BA659D4A-17AC7FBF</KEY>
</H>
<H>
 <PRIORITY>-3</PRIORITY>
 <KEY>D6306210-CF424F11-8E2D3496-E6CE1CA7</KEY>
</H>
<H>
 <PRIORITY>1</PRIORITY>
 <KEY>D6306210-CF424F11-8E2D3496-E6CE1CA7</KEY>
</H>
<H>
 <PRIORITY>-3</PRIORITY>
 <KEY>4EFR02B4-ADFDAF12-3C123II2-ADAFADFD</KEY>
</H>
<H>
 <PRIORITY>-14</PRIORITY>
 <KEY>5D2702B2-ECE8F1FB-3CEC3229-5FE4C4BC</KEY>
</H>
</ABC-FOF-PROCESS>

2 个答案:

答案 0 :(得分:2)

如果xml是固定格式,则可以逐行读取文件并随时调整结果。

它不是并行的,它不如使用xml解析功能强大,并且不会赢得任何美人奖,但它应该很快。

$hits = @{} # Hashtable containing number of hits per priority
$keys = @{} # Hashtable containing unique keys per priority
switch -Regex -File $env:temp\test.xml
{
    '^\s+<PRIORITY>(?<priority>[-]?\d+)'
    {
        $currentPriority = $matches.Priority
        $hits[$currentPriority] = $hits[$currentPriority]+1
        continue
    }
    '^\s+<KEY>(?<key>[\w-]+)'
    {
        $currentKey = $matches.Key
        if ($keys[$currentPriority] -eq $null) {$keys[$currentPriority] = @{}}
        $keys[$currentPriority][$currentKey] = $null
    }
}

$hits.GetEnumerator() | % {
    [PSCustomObject]@{
        PRIORITY = [int]$_.Key
        KEY = $keys[$_.Key].Count
        HITS = [int]$_.Value
    }
} | Sort PRIORITY -Descending

在500MB xml上进行了测试

PRIORITY KEY    HITS
-------- ---    ----
       1   1 1000000
      -3   2 2000000
     -14   2 3000000

$timer

IsRunning Elapsed          ElapsedMilliseconds ElapsedTicks
--------- -------          ------------------- ------------
    False 00:02:25.7186698              145718    413249113

答案 1 :(得分:1)

我猜这是一个集中于单个命令(Runtime of Foreach-Object vs Foreach loop)而不是完整解决方案的示例。

通常,我建议您查看整个解决方案,而不仅仅是the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts的单个语句。

在您的情况下,如果仅由于要使用Foreach语句而需要实例化脚本并使用Call Operator &来调用该脚本,则可能会超出目标:

对于您提供的小文件,
(将管道与ForEach-Object一起使用):

$groups | ForEach-Object {
    [PSCustomObject]@{
        PRIORITY = [int]$_.Name
        KEY = ($_.Group.KEY | Select-Object -Unique).Count
        HITS = $_.Count
    }
} | Sort-Object -Property PRIORITY -Descending # | Out-File -FilePath output.txt

通常以比这更快的速度出现(使用ForEach语句和Call运算符):

& {
    foreach ($group in $groups)
    {
        [PSCustomObject]@{
            PRIORITY = [int]$group.Name
            KEY = ($group.Group.KEY | Select-Object -Unique).Count
            HITS = $group.Count
        }
    }
} | Sort-Object -Property PRIORITY -Descending | Out-File -FilePath output.txt

由于Sort-Object cmdlet的本质(要求所有对象都能够对其进行排序),因此它需要暂停管道以对其进行重新排序,出于相同的原因,多线程方法可能不会很有道理。