我正在尝试提取网页中的所有元标记,目前我正在使用preg_match_all来获取它,但不幸的是它返回了数组索引的空字符串。
<?php
$meta_tag_pattern = '/<meta(?:"[^"]*"[\'"]*|\'[^\']*\'[\'"]*|[^\'">])+>/';
$meta_url = file_get_contents('test.html');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches) == 1)
echo "there is a match <br>";
print_r($matches);
?>
返回数组:
Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) ) Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) )
答案 0 :(得分:3)
DOMDocument的一个例子:
$url = 'test.html';
$dom = new DOMDocument();
@$dom->loadHTMLFile($url);
$metas = $dom->getElementsByTagName('meta');
foreach ($metas as $meta) {
echo htmlspecialchars($dom->saveHTML($meta));
}
答案 1 :(得分:1)
更新:从URL抓取元标记的示例:
$meta_tag_pattern = '/<meta\s[^>]+>/';
$meta_url = file_get_contents('http://stackoverflow.com/questions/10551116/html-php-escape-and-symbols-while-echoing');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches))
echo "there is a match <br>";
foreach ( $matches[0] as $value ) {
print htmlentities($value) . '<br>';
}
输出:
there is a match
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="og:type" content="website" />
...
看起来问题的一部分是浏览器将元标记呈现为元标记,而不是在print_r输出时显示文本,因此需要对它们进行转义。