PHP警告:未终止的实体引用(XML)

时间:2017-01-03 10:33:46

标签: php xml

我现在有一个问题。我想修改一些XML值。例如,我想从值中删除<![CDATA[" and the "]]>个单词。

奇怪的是,它适用于title,price和image_link,但不适用于url ...

这是我的代码:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->load('data/kinguin.xml');

$past = time();
echo '(Kinguin) - Starting to remove tags' . "\n";
deleteChildren($dom, 'id');
echo '(Kinguin) - id removed' . "\n";
deleteChildren($dom, 'description');
echo '(Kinguin) - description removed' . "\n";
deleteChildren($dom, 'google_product_category');
echo '(Kinguin) - google_product_category removed' . "\n";
deleteChildren($dom, 'brand');
echo '(Kinguin) - brand removed' . "\n";
deleteChildren($dom, 'mpn');
echo '(Kinguin) - mpn removed' . "\n";
deleteChildren($dom, 'condition');
echo '(Kinguin) - condition removed' . "\n";
deleteChildren($dom, 'product_type');
echo '(Kinguin) - product_type removed' . "\n";
deleteChildren($dom, 'availability');
echo '(Kinguin) - availability removed' . "\n";
deleteChildren($dom, 'quantity');
echo '(Kinguin) - quantity removed' . "\n";
deleteChildren($dom, 'identifier_exists');
echo '(Kinguin) - identifier_exists removed' . "\n";

removeCDATA($dom, 'title');
echo '(Kinguin) - title CDATA removed' . "\n";
removeCDATA($dom, 'price');
echo '(Kinguin) - price CDATA removed' . "\n";
removeCDATA($dom, 'image_link');
echo '(Kinguin) - image_link CDATA removed' . "\n";
removeCDATA($dom, 'url');
echo '(Kinguin) - url CDATA removed' . "\n";

$dom->saveXML();
$dom->save('data/kinguin.xml');

$xml = file_get_contents('data/kinguin.xml');
renameTags($xml, 'link', 'url', 'data/kinguin.xml');
echo '(Kinguin) - Renamed link' . "\n";

$now = time();
echo "(Kinguin) - Time needed: " . ($now - $past) . "s" . "\n";
echo "\n";

功能:

function deleteChildren($dom, $children){
    $root = $dom->documentElement;
    $marker = $root->getElementsByTagName($children);
    for($i = $marker->length - 1; $i >= 0 ; $i--){
        $child = $marker->item($i);
        $marker->item($i)->parentNode->removeChild($child);
    }
}

function renameTags($xml, $old, $new, $path){
    $dom = new DOMDocument('1.0', 'utf-8');
    $dom->preserveWhiteSpace = false;
    $dom->formatOutput = true;
    $dom->loadXML($xml);

    $nodes = $dom->getElementsByTagName($old);
    $toRemove = array();
    foreach ($nodes as $node) {
        $newNode = $dom->createElement($new);
        foreach ($node->attributes as $attribute) {
            $newNode->setAttribute($attribute->name, $attribute->value);
        }

        foreach ($node->childNodes as $child) {
            $newNode->appendChild($node->removeChild($child));
        }

        $node->parentNode->appendChild($newNode);
        $toRemove[] = $node;
    }

    foreach ($toRemove as $node) {
        $node->parentNode->removeChild($node);
    }

    $dom->saveXML();
    $dom->save($path);
}
function removeCDATA($dom, $tagName){

    $root = $dom->documentElement;
    $marker = $root->getElementsByTagName($tagName);
    for($i = $marker->length - 1; $i >= 0 ; $i--){
        $rename = $marker->item($i)->textContent;
        $newValue = preg_replace('/(<!\[CDATA\[)/', '', $rename);
        $newValue = preg_replace('/(]]>)/', '', $newValue);
        $newValue = preg_replace('/( EUR)/', '', $newValue);
        //ey-Shop\Cronjob.php on line 350 PHP Warning:  preg_replace(): Delimiter must not be alphanumeric or backslash in 351

        $marker->item($i)->nodeValue = $newValue;
    }
}

这是XML输出:

<?xml version="1.0" encoding="UTF-8"?>
<rss>
  <channel xmlns:g="http://base.google.com/ns/1.0" version="2.0">
    <title>google_EUR_english_1</title>
    <item>
      <title>Anno 2070 Uplay CD Key</title>
      <g:price>3.27</g:price>
      <g:image_link>http://cdn.kinguin.net/media/catalog/category/anno_8.jpg</g:image_link>
      <url><![CDATA[http://www.kinguin.net/category/4/anno-2070/?nosalesbooster=1&country_store=1&currency=EUR]]></url>
    </item>
    <item>
      <title>Anno 2070: Deep Ocean DLC Uplay CD Key</title>
      <g:price>4.75</g:price>
      <g:image_link>http://cdn.kinguin.net/media/catalog/category/anno-2070-deep-ocean-releasing-this-spring-1089268_1.jpg</g:image_link>
      <url><![CDATA[http://www.kinguin.net/category/5/anno-2070-deep-ocean-expansion-pack-dlc/?nosalesbooster=1&country_store=1&currency=EUR]]></url>
    </item>
    <item>

这是错误消息:

Warning: removeCDATA(): unterminated entity reference  All Stars-Racing Transformed RU VPN in C:\Users\Jan\PhpstormProjects\censored\Cronjob.php on line 353
PHP Warning:  removeCDATA(): unterminated entity reference  SUV DLC Steam Gift in C:\Users\Jan\PhpstormProjects\censored\Cronjob.php on line 353

第353行:

$marker->item($i)->nodeValue = $newValue;

问候和谢谢!

2 个答案:

答案 0 :(得分:0)

如果删除CDATA部分,最终会得到一个包含裸&个字符的元素,这是不合法的,因为&只能作为其命名实体转义存在({{1} }}或在CDATA部分内。

这就是为什么CDATA首先出现在那里&amp;应该留给消费解析器来处理。

答案 1 :(得分:0)

如果您确实认为需要从元素节点中删除任何CDATA部分,那么只需执行$foo->textContent = $foo->textContent,请参阅http://sandbox.onlinephpfunctions.com/code/cca5093433218c7c134f120725988fe6808f906c

function removeCDATA($dom, $tagName){

    $marker = $dom->getElementsByTagName($tagName);
    for($i = $marker->length - 1; $i >= 0 ; $i--){
        $marker->item($i)->textContent = $marker->item($i)->textContent;
    }
}

   $xml = '<root><items><item><url><![CDATA[http://example.com/search?a=1&b=2&c=3]]></url></item><item><url><![CDATA[http://example.com/search?a=4&b=5&c=6]]></url></item></items></root>';

   $doc = new DOMDocument();
   $doc->loadXML($xml);

   removeCDATA($doc, 'url');

   echo $doc->saveXML();

和输出

<root><items><item><url>http://example.com/search?a=1&amp;b=2&amp;c=3</url></item><item><url>http://example.com/search?a=4&amp;b=5&amp;c=6</url></item></items></root>