我想弄清楚如何获得
<title>A common title</title>
<meta name="keywords" content="Keywords blabla" />
<meta name="description" content="This is the description" />
即使它以任何顺序排列,我也听说过PHP Simple HTML DOM Parser,但我真的不想使用它。除了使用PHP Simple HTML DOM Parser之外,是否可以使用解决方案。
如果HTML无效, preg_match
将无法执行此操作吗?
cURL可以使用preg_match执行类似的操作吗?
Facebook会做这样的事情,但使用时可以正确使用:
<meta property="og:description" content="Description blabla" />
我想要这样的东西,以便当有人发布链接时,它应该检索标题和元标记。如果没有元标记,那么它会被忽略或者用户可以自己设置(但我稍后会自己设置)。
答案 0 :(得分:154)
这是应该的方式:
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("http://example.com/");
//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}
echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";
答案 1 :(得分:34)
<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>
答案 2 :(得分:8)
get_meta_tags
将为您提供帮助。要获得标题只需使用正则表达式。
$url = 'http://some.url.com';
preg_match("/<title>(.+)<\/title>/siU", file_get_contents($url), $matches);
$title = $matches[1];
希望有所帮助。
答案 3 :(得分:6)
Php的原生功能: get_meta_tags()
答案 4 :(得分:4)
你最好的选择是使用DOM Parser - 这是“正确的方法”。从长远来看,它将为您节省更多时间,而不是学习如何。已知使用正则表达式解析HTML是不可靠的,并且不能容忍特殊情况。
答案 5 :(得分:4)
不幸的是,内置的php函数get_meta_tags()需要name参数,某些站点(例如twitter)将其关闭以支持property属性。此函数使用正则表达式和dom文档的混合,将从网页返回键控数组的元标记。它检查name参数,然后检查property参数。这已经在instragram,pinterest和twitter上进行了测试。
/**
* Extract metatags from a webpage
*/
function extract_tags_from_url($url) {
$tags = array();
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$contents = curl_exec($ch);
curl_close($ch);
if (empty($contents)) {
return $tags;
}
if (preg_match_all('/<meta([^>]+)content="([^>]+)>/', $contents, $matches)) {
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . implode($matches[0]));
$tags = array();
foreach($doc->getElementsByTagName('meta') as $metaTag) {
if($metaTag->getAttribute('name') != "") {
$tags[$metaTag->getAttribute('name')] = $metaTag->getAttribute('content');
}
elseif ($metaTag->getAttribute('property') != "") {
$tags[$metaTag->getAttribute('property')] = $metaTag->getAttribute('content');
}
}
}
return $tags;
}
答案 6 :(得分:4)
get_meta_tags
无法使用标题。
只有名称属性的元标记,如
<meta name="description" content="the description">
将被解析。
答案 7 :(得分:3)
一个简单的功能,可以了解如何检索og:tags,标题和说明,并对其进行调整
function read_og_tags_as_json($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$HTML_DOCUMENT = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
$doc->loadHTML($HTML_DOCUMENT);
// fecth <title>
$res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;
// fetch og:tags
foreach( $doc->getElementsByTagName('meta') as $m ){
// if had property
if( $m->getAttribute('property') ){
$prop = $m->getAttribute('property');
// here search only og:tags
if( preg_match("/og:/i", $prop) ){
// get results on an array -> nice for templating
$res['og_tags'][] =
array( 'property' => $m->getAttribute('property'),
'content' => $m->getAttribute('content') );
}
}
// end if had property
// fetch <meta name="description" ... >
if( $m->getAttribute('name') == 'description' ){
$res['description'] = $m->getAttribute('content');
}
}
// end foreach
// render JSON
echo json_encode($res, JSON_PRETTY_PRINT |
JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
}
返回此页面(可能会有更多信息):
{
"title": "php - Getting title and meta tags from external website - Stack Overflow",
"og_tags": [
{
"property": "og:type",
"content": "website"
},
{
"property": "og:url",
"content": "https://stackoverflow.com/questions/3711357/getting-title-and-meta-tags-from-external-website"
},
{
"property": "og:site_name",
"content": "Stack Overflow"
},
{
"property": "og:image",
"content": "https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon@2.png?v=73d79a89bded"
},
{
"property": "og:title",
"content": "Getting title and meta tags from external website"
},
{
"property": "og:description",
"content": "I want to try figure out how to get the\n\n<title>A common title</title>\n<meta name=\"keywords\" content=\"Keywords blabla\" />\n<meta name=\"description\" content=\"This is the descript..."
}
]
}
答案 8 :(得分:3)
http://php.net/manual/en/function.get-meta-tags.php
<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>
答案 9 :(得分:2)
我们使用 Apache Tika 通过php(命令行实用程序)和-j为json:
<?php
shell_exec( 'java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying' );
?>
这是来自随机监护文章的示例输出:
{
"Content-Encoding":"UTF-8",
"Content-Length":205599,
"Content-Type":"text/html; charset\u003dUTF-8",
"DC.date.issued":"2013-07-21",
"X-UA-Compatible":"IE\u003dEdge,chrome\u003d1",
"application-name":"The Guardian",
"article:author":"http://www.guardian.co.uk/profile/nicholaswatt",
"article:modified_time":"2013-07-21T22:42:21+01:00",
"article:published_time":"2013-07-21T22:00:03+01:00",
"article:section":"Politics",
"article:tag":[
"Lynton Crosby",
"Health policy",
"NHS",
"Health",
"Healthcare industry",
"Society",
"Public services policy",
"Lobbying",
"Conservatives",
"David Cameron",
"Politics",
"UK news",
"Business"
],
"content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
"description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
"fb:app_id":180444840287,
"keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
"msapplication-TileColor":"#004983",
"msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/common/images/favicons/windows_tile_144_b.png",
"news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
"og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
"og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg",
"og:site_name":"the Guardian",
"og:title":"Tory strategist Lynton Crosby in new lobbying row",
"og:type":"article",
"og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"resourceName":"tory-strategist-lynton-crosby-lobbying",
"title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
"twitter:app:id:googleplay":"com.guardian",
"twitter:app:id:iphone":409128287,
"twitter:app:name:googleplay":"The Guardian",
"twitter:app:name:iphone":"The Guardian",
"twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"twitter:card":"summary_large_image",
"twitter:site":"@guardian"
}
答案 10 :(得分:1)
Easy和php的内置功能。</ p>
答案 11 :(得分:1)
<?php
// ------------------------------------------------------
function curl_get_contents($url) {
$timeout = 5;
$useragent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
// ------------------------------------------------------
function fetch_meta_tags($url) {
$html = curl_get_contents($url);
$mdata = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
$titlenode = $doc->getElementsByTagName('title');
$title = $titlenode->item(0)->nodeValue;
$metanodes = $doc->getElementsByTagName('meta');
foreach($metanodes as $node) {
$key = $node->getAttribute('name');
$val = $node->getAttribute('content');
if (!empty($key)) { $mdata[$key] = $val; }
}
$res = array($url, $title, $mdata);
return $res;
}
// ------------------------------------------------------
?>
答案 12 :(得分:1)
从url获取元标记,php函数示例:
function get_meta_tags ($url){
$html = load_content ($url,false,"");
print_r ($html);
preg_match_all ("/<title>(.*)<\/title>/", $html["content"], $title);
preg_match_all ("/<meta name=\"description\" content=\"(.*)\"\/>/i", $html["content"], $description);
preg_match_all ("/<meta name=\"keywords\" content=\"(.*)\"\/>/i", $html["content"], $keywords);
$res["content"] = @array("title" => $title[1][0], "descritpion" => $description[1][0], "keywords" => $keywords[1][0]);
$res["msg"] = $html["msg"];
return $res;
}
示例:
print_r (get_meta_tags ("bing.com") );
答案 13 :(得分:1)
现在,大多数网站都会向其网站添加元标记,提供有关其网站或任何特定文章页面的信息。如新闻或博客网站。
我创建了一个Meta API,它为您提供了所需的元数据,如OpenGraph,Schema.Org等。
答案 14 :(得分:1)
我的解决方案(改编自cronoklee和shamittomar的部分内容),因此我可以在任何地方调用它并获得JSON返回。可以轻松解析为任何内容。
<?php
header('Content-type: application/json; charset=UTF-8');
if (!empty($_GET['url']))
{
file_get_contents_curl($_GET['url']);
}
else
{
echo "No Valid URL Provided.";
}
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
echo json_encode(getSiteOG($data), JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
}
function getSiteOG( $OGdata){
$doc = new DOMDocument();
@$doc->loadHTML($OGdata);
$res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;
foreach ($doc->getElementsByTagName('meta') as $m){
$tag = $m->getAttribute('name') ?: $m->getAttribute('property');
if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = utf8_decode($m->getAttribute('content'));
}
return $res;
}
?>
答案 15 :(得分:0)
如前所述,这可以解决问题:
$url='http://stackoverflow.com/questions/3711357/get-title-and-meta-tags-of-external-site/4640613';
$meta=get_meta_tags($url);
echo $title=$meta['title'];
//php - Get Title and Meta Tags of External site - Stack Overflow
答案 16 :(得分:0)
我根据最佳答案制作了这个小作曲家包:https://github.com/diversen/get-meta-tags
composer require diversen/get-meta-tags
然后:
use diversen\meta;
$m = new meta();
// Simple usage, get's title, description, and keywords by default
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags');
print_r($ary);
// With more params
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags', array ('description' ,'keywords'), $timeout = 10);
print_r($ary);
它需要CURL和DOMDocument作为最佳答案 - 并且是以这种方式构建的,但是可以选择设置curl超时(以及获取所有类型的元标记)。
答案 17 :(得分:0)
我已经以不同的方式工作了,并认为我会分享。比别人少的代码,发现here。我添加了一些东西来使其加载您所在页面的元数据,而不是某个页面。我希望它可以将默认页面标题和描述自动复制到og标签中。
尽管出于某种原因,无论我尝试哪种方式(不同的脚本),页面都会加载非常慢的 online ,而即时加载。不知道为什么,所以我可能会选择一个开关盒,因为该站点并不大。
<?php
$url = 'http://sitename.com'.$_SERVER['REQUEST_URI'];
$fp = fopen($url, 'r');
$content = "";
while(!feof($fp)) {
$buffer = trim(fgets($fp, 4096));
$content .= $buffer;
}
$start = '<title>';
$end = '<\/title>';
preg_match("/$start(.*)$end/s", $content, $match);
$title = $match[1];
$metatagarray = get_meta_tags($url);
$description = $metatagarray["description"];
echo "<div><strong>Title:</strong> $title</div>";
echo "<div><strong>Description:</strong> $description</div>";
?>
并在HTML标头中
<meta property="og:title" content="<?php echo $title; ?>" />
<meta property="og:description" content="<?php echo $description; ?>" />
答案 18 :(得分:0)
从上面的@shamittomar改进的答案来获取元标记(或从html源获取指定的标记)
可以进一步改进...与php默认的get_meta_tags的区别在于,即使存在unicode字符串,它也可以工作
function getMetaTags($html, $name = null)
{
$doc = new DOMDocument();
try {
@$doc->loadHTML($html);
} catch (Exception $e) {
}
$metas = $doc->getElementsByTagName('meta');
$data = [];
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if (!empty($meta->getAttribute('name'))) {
// will ignore repeating meta tags !!
$data[$meta->getAttribute('name')] = $meta->getAttribute('content');
}
}
if (!empty($name)) {
return !empty($data[$name]) ? $data[$name] : false;
}
return $data;
}
答案 19 :(得分:0)
我们不应该使用OG吗?
选择的答案很好,但是当网站被重定向(非常常见!)并且不返回 OG标签({{ 3}}。这是一个小功能,在2018年会更有用。它会尝试获取OG标签,如果无法将它们退回到meta标签:
function getSiteOG( $url, $specificTags=0 ){
$doc = new DOMDocument();
@$doc->loadHTML(file_get_contents($url));
$res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;
foreach ($doc->getElementsByTagName('meta') as $m){
$tag = $m->getAttribute('name') ?: $m->getAttribute('property');
if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = $m->getAttribute('content');
}
return $specificTags? array_intersect_key( $res, array_flip($specificTags) ) : $res;
}
/////////////
//SAMPLE USE:
print_r(getSiteOG("http://www.stackoverflow.com")); //note the incorrect url
/////////////
//OUTPUT:
Array
(
[title] => Stack Overflow - Where Developers Learn, Share, & Build Careers
[description] => Stack Overflow is the largest, most trusted online community for developers to learn, shareâ âtheir programming âknowledge, and build their careers.
[type] => website
[url] => https://stackoverflow.com/
[site_name] => Stack Overflow
[image] => https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded
)
答案 20 :(得分:0)
如果您正在使用PHP,请查看pear.php.net处的Pear包,看看您是否找到了对您有用的内容。我已经有效地使用了RSS包,它节省了大量时间,前提是你可以通过他们的例子来了解他们如何实现代码。
具体来看看Sax 3,看看它是否适合您的需求。 Sax 3不再更新,但可能就足够了。
答案 21 :(得分:-1)
这是PHP简单的DOM HTML Class 2行代码,用于获取页面META的详细信息。
$html = file_get_html($link);
$meat_description = $html->find('head meta[name=description]', 0)->content;
$meat_keywords = $html->find('head meta[name=keywords]', 0)->content;