DOMDocument:loadHTML()正在转换htmlentities

时间:2017-03-05 21:59:52

标签: php xml domdocument html-entities php-5.6

相关问题是Preventing DOMDocument::loadHTML() from converting entities,但它没有产生解决方案。

此代码:

$html = "<span>&#x1F183;&#x1F174;&#x1F182;&#x1F183;</span>";
$doc = new DOMDocument;
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadhtml($html);
foreach ($doc->getElementsByTagName('span') as $node)
{
    var_dump($node->nodeValue);
    var_dump(htmlentities($node->nodeValue));
    var_dump(htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)));
}

制作此HTML:

string(16) ""
string(16) ""
string(0) ""

但我想要的是&#x1F183;&#x1F174;&#x1F182;&#x1F183;

我正在运行PHP 5.6.29版,ini_get("default_charset")返回UTF-8

1 个答案:

答案 0 :(得分:0)

http://php.net/manual/en/function.htmlentities.php上阅读更多内容后,我注意到它没有编码所有unicode。有人在评论中写了superentities但这个功能对我来说似乎不起作用。 UTF8entities功能确实如此。

以下是我从评论部分和代码修改的两个函数,而不是我想要的它给我的html编码值。

$html = "<span>&#x1F183;&#x1F174;&#x1F182;&#x1F183;</span>";
$doc = new DOMDocument;
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadhtml($html);
foreach ($doc->getElementsByTagName('span') as $node)
{
    var_dump(UTF8entities($node->nodeValue));
}


function UTF8entities($content="") {        
    $characterArray = preg_split('/(?<!^)(?!$)/u', $content );  // return array of every multi-byte character
    foreach ($characterArray as $character) {
        $rv .= unicode_entity_replace($character);
    }
    return $rv;
}

function unicode_entity_replace($c) { //m. perez 
    $h = ord($c{0});    
    if ($h <= 0x7F) { 
        return $c;
    } else if ($h < 0xC2) { 
        return $c;
    }

    if ($h <= 0xDF) {
        $h = ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
        $h = "&#" . $h . ";";
        return $h; 
    } else if ($h <= 0xEF) {
        $h = ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6 | (ord($c{2}) & 0x3F);
        $h = "&#" . $h . ";";
        return $h;
    } else if ($h <= 0xF4) {
        $h = ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12 | (ord($c{2}) & 0x3F) << 6 | (ord($c{3}) & 0x3F);
        $h = "&#" . $h . ";";
        return $h;
    }
}

返回:

string(36) "&#127363;&#127348;&#127362;&#127363;"