在Powershell脚本中,我有两个具有多列的数据集。并非所有这些列都是共享的。
例如,数据集1:
A B XY ZY
- - -- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
和数据集2:
A B ABC GH
- - --- --
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
我想合并这两个数据集,并指定哪些列用作键(在我的简单情况下为A和B)。预期结果是:
A B XY ZY ABC GH
- - -- -- --- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
5 val5 foo5 bar5 foo5 bar5
6 val6 foo6 bar6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
这个概念与SQL交叉联接查询非常相似。
我已经能够成功编写一个合并对象的函数。不幸的是,计算的持续时间是指数的。
如果我使用生成数据集:
$dsLength = 10
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
我得到这些结果:
$dsLength = 10
==> 33ms(精细)$dsLength = 100
==> 89ms(精细)$dsLength = 1000
==> 1563ms(可以接受)$dsLength = 5000
==> 35764毫秒(太多)$dsLength = 10000
==> 138047ms(太多)$dsLength = 20000
==> 573614毫秒(太多了)当数据集很大(我的目标是大约2万个项目)时,如何有效地合并数据集?
现在,我已经定义了以下函数:
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$result = @()
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
foreach($row1 in $Dataset1){
$result += $row1
$ds2propsNotInDs1Props | % {
$row1 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
}
foreach($row2 in $Dataset2){
$existing = foreach($candidate in $result){
$match = $true
foreach($prop in $Properties){
if(-not ($row2.$prop -eq $candidate.$prop)){
$match = $false
break
}
}
if($match){
$candidate
break
}
}
if(!$existing){
$ds1propsNotInDs2Props | % {
$row2 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$result += $row2
}else{
$ds2propsNotInDs1Props | % {
$existing.$($_.Name) = $row2.$($_.Name)
}
}
}
$result
}
我这样称呼这些功能:
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
我的感觉是速度慢是由于第二个循环,在该循环中我尝试匹配每次迭代中的现有行
[编辑]使用散列作为索引的第二种方法。令人惊讶的是,它的事件比第一次尝试要慢
$dsLength = 1000
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
function Get-Hash{
param(
[Parameter(Mandatory=$true)]
[object]$InputObject,
[Parameter()]
[string[]]$Properties
)
$InputObject | Select-object $properties | Out-String
}
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$result = @()
$index = @{}
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$allProps = $ds1props + $ds2props | select -Unique
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
$ds1index = @{}
foreach($row1 in $Dataset1){
$tempObject = new-object psobject
$result += $tempObject
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$ds1props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row1.$($_.Name)
}
$hash1 = Get-Hash -InputObject $row1 -Properties $Properties
$ds1index.Add($hash1, $tempObject)
}
foreach($row2 in $Dataset2){
$hash2 = Get-Hash -InputObject $row2 -Properties $Properties
if($ds1index.ContainsKey($hash2)){
# merge object
$existing = $ds1index[$hash2]
$ds2propsNotInDs1Props | % {
$existing.$($_.Name) = $row2.$($_.Name)
}
$ds1index.Remove($hash2)
}else{
# add object
$tempObject = new-object psobject
$ds1propsNotInDs2Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$ds2props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row2.$($_.Name)
}
$result += $tempObject
}
}
$result
}
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
[Edit2]将Measure-Command放在两个循环中会显示出第一个循环还很慢的事件。实际上,第一个循环占用了总时间的50%以上
答案 0 :(得分:2)
我同意@Matt。使用哈希表-类似于以下内容。该操作应在m + 2n
时间而不是mn
时间运行。
我的系统上的时间
上面的原始解决方案
#10 TotalSeconds : 0.07788
#100 TotalSeconds : 0.37937
#1000 TotalSeconds : 5.25092
#10000 TotalSeconds : 242.82018
#20000 TotalSeconds : 906.01584
这肯定看起来是O(n ^ 2)
以下解决方案
#10 TotalSeconds : 0.094
#100 TotalSeconds : 0.425
#1000 TotalSeconds : 3.757
#10000 TotalSeconds : 45.652
#20000 TotalSeconds : 92.918
这看起来是线性的。
解决方案
我使用三种技术来提高速度:
-
function Get-Hash{
param(
[Parameter(Mandatory=$true)]
[object]$InputObject,
[Parameter()]
[string[]]$Properties
)
$arr = [System.Collections.ArrayList]::new()
foreach($p in $Properties) { $arr += $InputObject.$($p) }
return ( $arr -join ':' )
}
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$results = [System.Collections.ArrayList]::new()
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
$hash = @{}
$Dataset2 | % { $hash.Add( (Get-Hash $_ $Properties), $_) }
foreach ($row in $dataset1) {
$key = Get-Hash $row $Properties
$tempObject = $row.PSObject.Copy()
if ($hash.containskey($key)) {
$r2 = $hash[$key]
$hash.remove($key)
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $r2.$($_.Name)
}
} else {
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
}
[void]$results.Add($tempObject)
}
foreach ($row in $hash.values ) {
# add missing dataset2 objects and extend
$tempObject = $row.PSObject.Copy()
$ds1propsNotInDs2Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
[void]$results.Add($tempObject)
}
$results
}
########
$dsLength = 10000
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
答案 1 :(得分:0)
对于将binary search(哈希表)合并到我的Join-Object cmdlet(另请参阅:In Powershell, what's the best way to join two tables into one?)中,我有很多疑问,因为有一些问题需要解决。克服那些被问题示例忽略掉的东西。
不幸的是,我无法与@mhhollomon解决方案的性能抗衡:
dsLength Steve1 Steve2 mhhollomon Join-Object
-------- ------ ------ ---------- -----------
10 19 129 21 50
100 145 915 158 329
1000 2936 9646 1575 3355
5000 56129 69558 5814 12653
10000 183813 95472 14740 25730
20000 761450 265061 36822 80644
但是我认为我可以增加一些价值:
哈希键是字符串,这意味着您需要将相关属性转换为字符串,这有点可疑,因为:
$Left -eq $Right ≠ "$Left" -eq "$Right"
在大多数情况下它可以工作,尤其是当源文件是.csv
文件时,但是它可能会出错,例如如果数据来自cmdlet,其中$Null
确实意味着其他内容,则为空字符串(''
)。因此,我建议明确定义$Null
键,例如和Control character。
并且由于属性值很容易包含冒号(:
),因此我还建议使用控制字符来分隔(联接)多个键。
使用散列表实际上还有一个问题,那就是另一个陷阱:如果左侧($dataset1
)和/或右侧($dataset2
)具有多个匹配项,该怎么办。以例如以下数据集:
$dataset1 =
ConvertFrom-SourceTable
'
A B XY ZY
- - -- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
4 val4 foo4a bar4a
5 val5 foo5 bar5
6 val6 foo6 bar6
'
$dataset2 =
ConvertFrom-SourceTable
'
A B ABC GH
- - --- --
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
5 val5 foo5a bar5a
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
'
在这种情况下,我期望SQL连接会出现类似的结果,并且不会出现Item has already been added. Key in dictionary
错误:
$Dataset1 | FullJoin $dataset2 -On A, B | Format-Table
A B XY ZY ABC GH
- - -- -- --- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
4 val4 foo4a bar4a foo4 bar4
5 val5 foo5 bar5 foo5 bar5
5 val5 foo5 bar5 foo5a bar5a
6 val6 foo6 bar6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
您可能已经发现,没有理由将双方都放在哈希表中,但是您可以考虑流左侧(而不是阻塞输入)。在问题的示例中,两个数据集都直接加载到内存中,这几乎不是用例。更常见的是您的数据来自其他地方,例如您可能可以在活动目录的远程目录中同时找到下一个对象之前在哈希表中的每个传入对象。以下cmdlet的计数相同:它可以直接开始处理输出,而不必等到您的cmdlet完成(请注意,准备就绪后,数据将从Join-Object
cmdlet中立即释放)。在这种情况下,使用Measure-Command
来衡量性能需要一种完全不同的方法...
另请参阅:Computer Programming: Is the PowerShell pipeline sequential mode more memory efficient? Why or why not?