我有一个字符串值,我正在尝试提取列表项。我想提取文本和任何子节点,但DOMDocument正在将实体转换为角色,而不是保留原始状态。
我尝试将DOMDocument :: resolveExternals和DOMDocument :: substituteEntities设置为false,但这没有效果。应该注意我使用PHP 5.2.17在Win7上运行。
示例代码是:
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li></ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
½
最终会转换为½(单字符/ UTF-8版本,而不是实体版本),这不是所需的格式。
答案 0 :(得分:5)
不是PHP 5.3.6 ++的解决方案
$html =<<<HTML
<ul><li>text</li>
<li>½ of this is <strong>strong</strong></li></ul>
HTML;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}
答案 1 :(得分:2)
基于answer provided by ajreal,我扩展了示例变量以处理更多情况,并更改了_get_inner_html()以进行递归调用并处理文本节点的实体转换。
这可能不是最佳答案,因为它对元素做出了一些假设(例如没有属性)。但是因为我的特殊需求不需要传递属性(但是......我确定我的样本数据会在以后抛出那个),这个解决方案对我有用。
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in ½ tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in ½ <em>tag ½</em></strong></li>'.
'</ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
echo 'Node type is '.$child->nodeType.PHP_EOL;
switch ($child->nodeType) {
case 3:
$innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
break;
default:
echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
echo 'Node name '.$child->nodeName.PHP_EOL;
$innerHTML .= '<'.$child->nodeName.'>';
$innerHTML .= _get_inner_html( $child );
$innerHTML .= '</'.$child->nodeName.'>';
break;
}
}
return $innerHTML;
}
答案 2 :(得分:0)
不需要迭代子节点:
function innerHTML($node)
{$html=$node->ownerDocument->saveXML($node);
return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
}