Question

我有大型XML文件（＆＃34; ONIX＆＃34;标准）我想分开。基本结构是：

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<!-- DOCTYPE is not always present and might look differently -->
<ONIXmessage> <!-- sometimes with an attribute -->
<header>
...
</header> <!-- up to this line every out-file should be identical to source -->
<product> ... </product>
<product> ... </product>
...
<product> ... </product>
<ONIXmessage>

我想要做的是将此文件拆分为大小相同的 n 小文件。为此，我计算<product>个节点的数量，将它们除以 n 并将它们克隆到 n 个新的xml文件中。我经常搜索，这项任务似乎比我想象的要难。

到目前为止我无法解决的是克隆一个新的XML文档，该文档具有相同的xml声明，doctype，根元素和<header>节点，但没有<product>s。我可以使用正则表达式执行此操作，但我更喜欢使用xml工具。
将大量<product>节点传输到新XML文档的最智能方法是什么？对象表示法，例如$xml.ONIXmessage.product | % { copy... }，XPath()查询（您可以选择带有XPath（）的 n 节点吗？）和CloneNode()或XMLReader / {{ 1}}？
节点的内容在格式和编码方面应该相同。如何确保这一点？

我非常感谢正确方向的一些推动！

Answer 1

一种方法是：

制作xml文件的副本
删除副本中的所有产品节点
使用循环将一个产品从原始文件一次复制到其中一个副本。
达到每个文件的产品数限制时，请保存当前文件（副本）并创建新文件。

示例：

param($path, [int]$maxitems)

$file = Get-ChildItem $path

################

#Read file
$xml = [xml](Get-Content -Path $file.FullName | Out-String)
$product = $xml.SelectSingleNode("//product")
$parent = $product.ParentNode

#Create copy-template
$copyxml = [xml]$xml.OuterXml
$copyproduct = $copyxml.SelectSingleNode("//product")
$copyparent = $copyproduct.ParentNode
#Remove all but one product (to know where to insert new ones)
$copyparent.SelectNodes("product") | Where-Object { $_ -ne $copyproduct } | ForEach-Object { $copyparent.RemoveChild($_) } > $null

$allproducts = @($parent.SelectNodes("product"))
$totalproducts = $allproducts.Count

$fileid = 1
$i = 0

foreach ($p in $allproducts) {
    #IF beggining or full file, create new file
    if($i % $maxitems -eq 0) {
        #Create copy of file
        $newFile = [xml]($copyxml.OuterXml)
        #Get parentnode
        $newparent = $newFile.SelectSingleNode("//product").ParentNode
        #Remove all products
        $newparent.SelectNodes("product") | ForEach-Object { $newparent.RemoveChild($_) } > $null
    }

    #Copy productnode
    $cur = $newFile.ImportNode($p,$true)
    $newparent.AppendChild($cur) > $null

    #Add 1 to "items moved"
    $i++ 

    #IF Full file, save
    if(($i % $maxitems -eq 0) -or ($i -eq $totalproducts)) {
        $newfilename = $file.FullName.Replace($file.Extension,"$fileid$($file.Extension)")
        $newFile.Save($newfilename)
        $fileid++
    }

}

UPDATE：由于性能在这里非常重要，我创建了一个新版本的脚本，它使用foreach-loop和xml-template来复制99％的读取操作和删除的操作。这个概念仍然相同，但它以不同的方式执行。

基准：

10 items, 3 per xml OLD solution: 0.0448831 seconds
10 items, 3 per xml NEW solution: 0.0138742 seconds
16001 items, 1000 per xml items OLD solution: 73.1934346 seconds
16001 items, 1000 per xml items NEW solution: 5.337443 seconds

Answer 2

只是想出一个让你考虑的想法，它既没有经过测试，也没有完成：

将XML导入数组。将array.count除以n，然后循环导出到新XML文件的数组。您可能必须在导出之前创建n个数组。

例如：使用Import-Clixml和Export-Clixml cmdlet。

假设所有XML节点都是相同的对象类型。

如何使用Powershell将XML文件拆分为更小的文件

2 个答案: