PHP有一个简单的命令来获取网页的元标记(get_meta_tags),但这仅适用于具有名称属性的元标记。然而,开放图谱协议如今变得越来越流行。从网页获取opg值的最简单方法是什么?例如:
<meta property="og:url" content="">
<meta property="og:title" content="">
<meta property="og:description" content="">
<meta property="og:type" content="">
我看到的基本方法是通过cURL获取页面并使用正则表达式解析它。有什么想法吗?
答案 0 :(得分:39)
非常简单,做得很好:
使用https://github.com/scottmac/opengraph
$graph = OpenGraph::fetch('http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html');
print_r($graph);
将返回
OpenGraph对象
(
[_values:OpenGraph:private] => Array
(
[type] => article
[video] => http://www.avessotv.com.br/player/flowplayer/flowplayer-3.2.7.swf?config=%7B%27clip%27%3A%7B%27url%27%3A%27http%3A%2F%2Fwww.avessotv.com.br%2Fmedia%2Fprogramas%2Fpantene.flv%27%7D%7D
[image] => /wp-content/thumbnails/9025.jpg
[site_name] => Programa Avesso - Bastidores
[title] => Bastidores “Pantene Institute Experience†P&G
[url] => http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html
[description] => Confira os bastidores do Pantene Institute Experience, da Procter & Gamble. www.pantene.com.br Mais imagens:
)
[_position:OpenGraph:private] => 0
)
答案 1 :(得分:24)
从HTML解析数据时,你真的不应该使用正则表达式。看看DOMXPath Query function。
现在,实际代码可能是:
[编辑] Stefan Gehrig给出了更好的XPath查询,因此代码可以缩短为:
libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
而不是:
$doc = new DomDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
if(!empty($property) && preg_match('#^og:#', $property)) {
$rmetas[$property] = $content;
}
}
var_dump($rmetas);
答案 2 :(得分:3)
怎么样:
preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $str, $matches);
所以,是的,以任何方式抓取页面并使用正则表达式解析
答案 3 :(得分:2)
根据这种方法,你将得到facebook开放图形标签的密钥对数组。
$url="http://fbcpictures.in";
$site_html= file_get_contents($url);
$matches=null;
preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $site_html,$matches);
$ogtags=array();
for($i=0;$i<count($matches[1]);$i++)
{
$ogtags[$matches[1][$i]]=$matches[2][$i];
}
答案 4 :(得分:0)
使用XPath的方法越多XML
:
$xml = simplexml_load_file('http://ogp.me/');
$xml->registerXPathNamespace('h', 'http://www.w3.org/1999/xhtml');
$result = array();
foreach ($xml->xpath('//h:meta[starts-with(@property, \'og:\')]') as $meta) {
$result[(string)$meta['property']] = (string)$meta['content'];
}
print_r($result);
不幸的是,如果HTML文档在<html>
- 标记中使用名称空间声明,则需要注册名称。
答案 5 :(得分:0)
这个函数完成没有依赖和DOM解析的工作:
function getOgTags($html)
{
$pattern='/<\s*meta\s+property="og:([^"]+)"\s+content="([^"]*)/i';
if(preg_match_all($pattern, $html, $out))
return array_combine($out[1], $out[2]);
return array();
}
测试代码:
$x=' <title>php - Using domDocument, and parsing info, I would like to get the 'href' contents of an 'a' tag - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="referrer" content="origin" />
<meta property="og:type" content="website"/>
<meta property="og:url" content="https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"/>
<meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="twitter:title" property="og:title" itemprop="title name" content="Using domDocument, and parsing info, I would like to get the 'href' contents of an 'a' tag" />
<meta name="twitter:description" property="og:description" itemprop="description" content="Possible Duplicate:
Regular expression for grabbing the href attribute of an A element
This displays the what is between the a tag, but I would like a way to get the href contents as well.
Is..." />';
echo '<pre>';
var_dump(getOgTags($x));
你得到:
array(3) {
["type"]=>
string(7) "website"
["url"]=>
string(119) "https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"
["image"]=>
string(85) "https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded"
}
答案 6 :(得分:0)
这就是我用来提取Og标签的东西。
function get_og_tags($get_url = "", $ret = 0)
{
if ($get_url != "") {
$title = "";
$description = "";
$keywords = "";
$og_title = "";
$og_image = "";
$og_url = "";
$og_description = "";
$full_link = "";
$image_urls = array();
$og_video_name = "";
$youtube_video_url="";
$get_url = $get_url;
$ret_data = file_get_contents_curl($get_url);
//$html = file_get_contents($get_url);
$html = $ret_data['curlData'];
$full_link = $ret_data['full_link'];
$full_link = addhttp($full_link);
//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
if ($nodes->length == 0) {
$title = $get_url;
} else {
$title = $nodes->item(0)->nodeValue;
}
//get and display what you need:
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if ($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
$og = $doc->getElementsByTagName('og');
for ($i = 0; $i < $metas->length; $i++) {
$meta = $metas->item($i);
if ($meta->getAttribute('property') == 'og:title')
$og_title = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:url')
$og_url = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:image')
$og_image = $meta->getAttribute('content');
if ($meta->getAttribute('property') == 'og:description')
$og_description = $meta->getAttribute('content');
// for sociotube video share
if ($meta->getAttribute('property') == 'og:video_name')
$og_video_name = $meta->getAttribute('content');
// for sociotube youtube video share
if ($meta->getAttribute('property') == 'og:youtube_video_url')
$youtube_video_url = $meta->getAttribute('content');
}
//if no image found grab images from body
if ($og_image != "") {
$image_urls[] = $og_image;
} else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img"); // find your image
$imgCount = 0;
for ($i = 0; $i < $nodelist->length; $i++) {
$node = $nodelist->item($i); // gets the 1st image
if (isset($node->attributes->getNamedItem('src')->nodeValue)) {
$src = $node->attributes->getNamedItem('src')->nodeValue;
}
if (isset($node->attributes->getNamedItem('src')->value)) {
$src = $node->attributes->getNamedItem('src')->value;
}
if (isset($src)) {
if (!preg_match('/blank.(.*)/i', $src) && filter_var($src, FILTER_VALIDATE_URL)) {
$image_urls[] = $src;
if ($imgCount == 10) break;
$imgCount++;
}
}
}
}
$page_title = ($og_title == "") ? $title : $og_title;
if(!empty($og_video_name)){
// for sociotube video share
$page_body = $og_video_name;
}else{
// for post share
$page_body = ($og_description == "") ? $description : $og_description;
}
$output = array('title' => $page_title, 'images' => $image_urls, 'content' => $page_body, 'link' => $full_link,'video_name'=>$og_video_name,'youtube_video_url'=>$youtube_video_url);
if ($ret == 1) {
return $output; //output JSON data
}
echo json_encode($output); //output JSON data
die;
} else {
$data = array('error' => "Url not found");
if ($ret == 1) {
return $data; //output JSON data
}
echo json_encode($data);
die;
}
}
该功能的使用
$url = "https://www.alectronics.com";
$tagsArray = get_og_tags($url);
print_r($tagsArray);
答案 7 :(得分:-1)
具有本地PHP函数get_meta_tags()。