Question

PHP专家。

我在使用simple_html_dom类时发现了一个错误。

我必须解析的html字符串是这样的。

<!DOCTYPE html>
<html lang="en">
<head>
<title>Y-shaped ZnO Nanobelts Driven from Twinned</title>

<meta name="site" content="Reports"/>

<meta name="description" content="Description with twinned planes {11&#"/>

<meta name="image" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a"/>


...


</body>
</html>

我尝试使用find（“meta [name = image]”）获取名为image的元标记，但我不能。

我检查了原因并发现它是因为上面一行中间的字符'＆amp;＃'。

<meta name="description" content="Description with twinned planes {11&#"/>

我得到了那个元标记的内容属性

 Description with twinned planes {11&#"/>   <meta name="image" ....

所以在这种情况下，我应该怎么做才能让simple_html_dom正确解析html？

否则是否还有其他库可以正确解析这个html？

Answer 1

试试此代码：使用php DomDocument

您可以使用getElementsByTagName获取元数据，并使用getAttribute

获取属性值

$hml = '<!DOCTYPE html>
<html lang="en">
<head>
<title>Y-shaped ZnO Nanobelts Driven from Twinned</title>

<meta name="site" content="Reports"/>

<meta name="description" content="Description with twinned planes {11&#"/>

<meta name="image" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a"/>
</head>
<body>

</body>
</html>';

$dom = new DOMDocument();
libxml_use_internal_errors(true);

$dom->loadHTML($hml);

$metas = $dom->getElementsByTagName('meta');

foreach($metas as $meta){

if($meta->getAttribute('name')=="image"){echo $meta->getAttribute('content');}

}

输出：

https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a

注意：如果要从页面加载中加载内容 $dom->loadHTMLFile("your_pagename.html");代替此 $dom->loadHTML($hml);

php simple_html_dom解析器出错

1 个答案: