Question

我们有一个脚本可以解析用户生成的XML源，这些源不时包含带有特殊字符的不正确格式的条目。

虽然我通常只是在行上运行utf8_encode（），但我不确定如何执行此操作，因为DOM正在逐步读取文件，并且在发生expand命令时会抛出错误。

由于simple_xml会对代码产生阻塞，因此后续行也会关闭。

这是代码。

$z = new XMLReader; 
$z->open($filename); $doc = new DOMDocument('1.0','UTF-8');         
while ($z->read() && $z->name !== 'product');   
while ($z->nodeType == XMLReader::ELEMENT AND $z->name === 'product'){
$producti = simplexml_import_dom($doc->importNode($z->expand(), true));
print_r($producti);
}

错误：

消息：XMLReader :: expand（）：foo.xml：29081：解析器错误：输入是   不正确的UTF-8，表示编码！字节：0x05 0x20 0x2D 0x35


严重性：警告

消息：XMLReader :: expand（）：错误   扩展时发生

文件名：controllers / feeds.php


行号：106


消息：传递给DOMDocument :: importNode（）的参数1必须是   DOMNode的实例，给定的布尔值

文件名：   controllers / feeds.php

行号：106

Answer 1

首先使用HTML Tidy库清理字符串。

另外，我最好使用DOMDocument而不是XMLReader。

类似的东西：

        $tidy = new Tidy;

        $config = array(
                'drop-font-tags' => true,
                'drop-proprietary-attributes' => true,
                'hide-comments' => true,
                'indent' => true,
                'logical-emphasis' => true,
                'numeric-entities' => true,
                'output-xhtml' => true,
                'wrap' => 0
        );

        $tidy->parseString($html, $config, 'utf8');

        $tidy->cleanRepair();

        $xml = $tidy->value; // Get clear string

        $dom = new DOMDocument;

        $dom->loadXML($xml);

        ...

使用PHP的XMLReader，DOM和SimpleXML强制UTF8格式

1 个答案: