Question

我有一个xml文件，其中包含多个声明，如下所示

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <node>
  <element1>Stefan</element1>
  <element2>42</element2>
  <element3>Shirt</element3>
  <element4>3000</element4>  
</node>
</root>

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <node>
  <element1>Damon</element1>
  <element2>32</element2>
  <element3>Jeans</element3>
  <element4>4000</element4>  
</node>
</root>

当我尝试使用

加载xml时

$data = simplexml_load_file("testdoc.xml") or die("Error: Cannot create object");

然后它给我以下错误

Warning: simplexml_load_file(): testdoc.xml:11: parser error : XML declaration allowed only at the start of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3

Warning: simplexml_load_file(): <?xml version="1.0" encoding="UTF-8"?> in C:\xampp\htdocs\crea\services\testxml.php on line 3

Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3

Warning: simplexml_load_file(): testdoc.xml:12: parser error : Extra content at the end of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3

Warning: simplexml_load_file(): <root> in C:\xampp\htdocs\crea\services\testxml.php on line 3

Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3
Error: Cannot create object

请让我知道如何解析这个xml或如何将其拆分为no xml文件，以便我可以阅读。文件大小约为1 GB。

Answer 1

第二行

<?xml version="1.0" encoding="UTF-8"?>

需要删除。在任何文件中只允许1 xml声明，它必须是第一行。

严格地说，你还需要一个单独的根元素（虽然我已经看过宽松的解析器）。只需使用伪标记包装内容，以便您的文件看起来像：

<?xml version="1.0" encoding="UTF-8"?>
<metaroot><!-- synthetic unique root, no semantics attached -->
    <root>
        <!-- ... -->
    </root>
    <root>
        <!-- ... -->
    </root>

    <!-- ... -->
</metaroot>

（非常）大文件的解决方案：

使用sed消除违规的xml声明，使用printf添加单个xml声明和唯一的根元素。一系列bash命令如下：

  printf "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<metaroot>\n" >out.xml
  sed '/<\?xml /d' in.xml >>out.xml
  printf "\n</metaroot>\n" >>out.xml

in.xml表示您的原始文件，out.xml清除结果。

printf打印单个xml声明和开始/结束标记。 sed是一种逐行编辑文件的工具，可根据正则表达式模式匹配执行操作。要匹配的模式是xml声明的开头（<\? xml），要执行的操作是删除该行。

注意：

命令中的反斜杠在它们出现的位置以特殊语义转义符号。
sed也适用于windows / macos。

替代解决方案

另一种选择是将文件拆分为单独的格式良好的文件（取自this SO answer：

csplit -z -f 'temp' -b 'out%03d.xml' in.xml '/<\?xml /' {*}

生成名为out000.xml，out001.xml的文件，... 您应该至少知道已经处理到输入文件中的单个文件数量的大小以确保自动编号是安全的（尽管您当然可以使用{{1}将输入文件的字节数作为幅度。在上面的命令中）。

Answer 2

这不是有效的XML。您将需要使用字符串函数来拆分它 - 或者更准确地说是逐个读取它。

$xmlDeclaration = '<?xml version="1.0" encoding="UTF-8"?>';

$file = new SplFileObject($filename, 'r');
$file->setFlags(SplFileObject::SKIP_EMPTY);
$buffer = '';
foreach ($file as $line) {
  if (FALSE === strpos($line, $xmlDeclaration)) {
    $buffer .= $line; 
  } else {
    outputBuffer($buffer);
    $buffer = $line;
  }
}
outputBuffer($buffer);

function outputBuffer($buffer) {
  if (!empty($buffer)) {
    $dom = new DOMDocument();
    $dom->loadXml($buffer);
    $xpath = new DOMXPath($dom);
    echo $xpath->evaluate('string(//element1)'), "\n";
  }
}

输出：

Stefan
Damon

解析器错误：仅在文档开头允许XML声明

2 个答案:

替代解决方案