Question

我知道如何通过DOMDocument获取节点路径：

$dom = new DOMDocument;

$dom->loadXML('<fruits><fruit><name>Apple</name><name>Banana</name></fruit></fruits>');

foreach($dom->getElementsByTagName('*') as $node){
    // e.g. $node->getNodePath();
};

我的问题是：我需要获取所有节点+文件中出现的次数，并且我有非常大的文件。

示例文件是：

<products>
    <product>
        <properties>
            <property></property>
            <property></property>
        </properties>
    </product>
    ...
</products>

节点<products>出现一次（因为它是根节点）
节点<product>出现了60 000次
节点<property>出现120 000次（每个产品2次）

警告：因为每个文件都不同，我没有根节点的名称！在这个例子中，它是<products>，但它可能是其他东西）。要获取根节点的名称，我使用以下代码：

$simpleXML = simplexml_load_file(<-- filename goes here -->);
$root = $simpleXML->getName();

我找到了这个存储库：https://github.com/dkrnl/SimpleXMLReader

然后我使用这段代码：

$reader = new SimpleXMLReader;

$reader->open(<!-- filename goes here -->);

$reader->registerCallback($root,function($reader){

    $xml = $reader->expandDomDocument();

    foreach($xml->childNodes as $child){

        list($nodes,$counter) = getChildrenOfAllNodes($child,$nodes,$counter);

    };

};

$reader->parse();

$reader->close();

这是我的＆＃34; getChildrenOfAllNodes＆＃34; -function：

    function getChildrenOfAllNodes(DOMNOde $node,$nodes,$counter){

        foreach($node->childNodes as $child){

            if($child->hasChildNodes()){

                list($nodes,$counter) = getChildrenOfAllNodes($child,$nodes,$counter);

            };

            if(strpos($child->nodeName,'#') === false){

                if(array_key_exists($child->nodeName,$nodes)){

                    $nodes[$child->nodeName]['count'] += 1;

                    $nodes[$child->nodeName]['path'] = $child->getNodePath();

                }else{

                    $nodes[$child->nodeName] = array(
                        'name'  => $child->nodeName,
                        'path'  => $child->getNodePath(),
                        'count' => 1
                    );

                }

                $counter++;

            };

        };

        return array($nodes,$counter);

    };

它适用于大约1000个节点的文件，但是具有超过1000个节点的文件，它会继续处理。

我的问题是：是否有一个（更好的）解决方案（比这个）在xml文件中为非常大的文件获取所有名称+ nodepath？

谢谢！

Answer 1

XMLReader是要走的路。但是你不应该扩展整个文档（这就是示例中发生的事情）。

您使用XMLReader:read()和XMLReader:next()导航到代表您记录的节点（product）。将该节点扩展为DOM并使用DOM方法/ xpath获取数据DOMNode::getNodePath()以获取部分节点路径。

使用外部结构手动前缀该路径，例如根据它改变它。

$reader = new XMLReader();
$reader->open('php://stdin');

$document = new DOMDocument();
$xpath= new DOMXpath($document);

while ($reader->read() and $reader->localName != 'fruit') { 
}

if ($reader->localName == 'fruit') {
  $counter = 0;
  do {
    $fruit = $reader->expand($document);
    $counter++;
    foreach ($xpath->evaluate('name', $fruit) as $name) {
      var_dump(
        [ 
          'name' => $name->textContent,
          'local_path' => $name->getNodePath(),
          'path' => preg_replace(
            '(^/(\w+))', '/fruits$2['.$counter.']', $name->getNodePath()
          )  
        ]
      );
    }
  } while ($reader->next('fruit'));
}

输出：

array(3) {
  ["name"]=>
  string(5) "Apple"
  ["local_path"]=>
  string(14) "/fruit/name[1]"
  ["path"]=>
  string(18) "/fruits[1]/name[1]"
}
array(3) {
  ["name"]=>
  string(6) "Banana"
  ["local_path"]=>
  string(14) "/fruit/name[2]"
  ["path"]=>
  string(18) "/fruits[1]/name[2]"
}

如果您不知道节点本身，则必须使用该结构进行迭代，检查节点类型并将找到的节点名称存储到变量中。

$nodeNames = [
  'list' => NULL,
  'item' => NULL
];
while ($reader->read()) {
  if ($reader->nodeType == XML_ELEMENT_NODE) {
    if (NULL === $nodeNames['list']) {
      $nodeNames['list'] = $reader->localName;
    } elseif (NULL === $nodeNames['item']) {
      $nodeNames['item'] = $reader->localName;
    } else {
      break;
    }
  }
}

var_dump($nodeNames);
if ($reader->nodeType == XML_ELEMENT_NODE && $reader->localName == $nodeNames['item']) {
  $counter = 0;
  do {
    $item = $reader->expand($document);
    var_dump($item->getNodePath());
  } while ($reader->next($nodeNames['item']));
}

如何通过XMLreader

1 个答案: