您好我想从页面源中提取og:图像内容。如何从源中提取og:图像元标记内容?
这是元标记:
<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />
如何使用正则表达式识别元标记?
这是我目前从img标签获取的函数抓取图片网址。使用og:image meta标签需要进行哪些修改?
function feeds_imagegrabber_scrape_images($content, $base_url, array $options = array(), &$error_log = array()) {
// Merge the default options.
$options += array(
'expression' => '//img',
'getsize' => TRUE,
'max_imagesize' => 512000,
'timeout' => 10,
'max_redirects' => 3,
'feeling_lucky' => 0,
);
$doc = new DOMDocument();
if (@$doc->loadXML($content) === FALSE && @$doc->loadHTML($content) === FALSE) {
$error_log['code'] = -5;
$error_log['error'] = "unable to parse the xml//html content";
return FALSE;
}
$xpath = new DOMXPath($doc);
$hrefs = @$xpath->evaluate($options['expression']);//echo '<pre> HREFS : ';print_r($hrefs->length);exit;
if ($options['getsize']) {
timer_start(__FUNCTION__);
}
$images = array();
$imagesize = 0;
for ($i = 0; $i < $hrefs->length; $i++) {
$url = $hrefs->item($i)->getAttribute('src');
if (!isset($url) || empty($url) || $url == '') {
continue;
}
if(function_exists('encode_url')) {
$url = encode_url($url);
}
$url = url_to_absolute($base_url, $url);
if ($url == FALSE) {
continue;
}
if ($options['getsize']) {
if (($imagesize = feeds_imagegrabber_validate_download_size($url, $options['max_imagesize'], ($options['timeout'] - timer_read(__FUNCTION__) / 1000))) != -1) {
$images[$url] = $imagesize;
if ($settings['feeling_lucky']) {
break;
}
}
if (($options['timeout'] - timer_read(__FUNCTION__) / 1000) <= 0) {
$error_log['code'] = FIG_HTTP_REQUEST_TIMEOUT;
$error_log['error'] = "timeout occured while scraping the content";
break;
}
}
else {
$images[$url] = $imagesize;
if ($settings['feeling_lucky']) {
break;
}
}
}
echo '<pre>';print_r($images);exit;
return $images;
}
答案 0 :(得分:3)
如果你必须使用正则表达式,这将有效:
<meta.*property="og:image".*content="(.*)".*\/>
正则表达式示例:http://regex101.com/r/rX1zK7
PHP示例
$html = '<html>
<head>
<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />
</head>
<body>
</body>
</html>';
preg_match_all('/<meta.*property="og:image".*content="(.*)".*\/>/', $html, $matches);
echo $matches[1][0];
输出:
http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg
答案 1 :(得分:2)
利用DOMDocument
班级
<?php
$html='<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('meta') as $tag) {
if ($tag->getAttribute('property') === 'og:image') {
echo $tag->getAttribute('content');
}
}
输出:
http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg