Question

我试图以两种方式获取页面标题：

使用html meta＆lt;标题＆GT;拉布勒并使用Open Grap og：title。

所以我使用以下正则表达式：

$title_expression = "/<title>([^<]*)<\/title>/"; 
$title_og_expression = "/og:title[^>]+content=\"([^\"]*)\"[^>]*>/"; 

preg_match($this->title_expression, $this->content, $match_title);
preg_match($this->title_og_expression, $this->content, $match_title2);

$output = $match_title[1].'+'.$matcht_title2[1];

有没有办法只用一个preg_match就可以做到这一点？

请注意，我不希望One OR the Other，而是BOTH值。

感谢您的建议！

Answer 1

使用DOMDocument更适合此任务：

$doc = new DOMDocument();
@$doc->loadHTML($this->content);

$title = $doc->getElementsByTagName('title')->item(0)->textContent;

$metas = $doc->getElementsByTagName('meta');

$ogtitle = '';

foreach ($metas as $meta) {
    if ($meta->getAttribute('property') == 'og:title') {
        $ogtitle = $meta->getAttribute('content');
        break;
    }
}
$output = $title . '+' . $ogtitle;

Answer 2

这是一个匹配两者的表达式，但您必须检查两个捕获组以查看哪一个匹配。这有用吗？

/<title>(.*?)<\/title>|<og:title.*?content="(.*?)"/i

小提琴：http://www.rexfiddle.net/N3Hth2o

编辑：此表达式将与一个捕获组匹配，但是很危险，因为如果标题中包含类似HTML的字符，它可能会匹配标题内的某些内容。同样，更好的方法是使用DOM解析器。

/<(?:title>|og:title.*?content=")(.*?)(?:</title>|".*?>)/i

小提琴：http://www.rexfiddle.net/NBXP5rq

Answer 3

不要使用RegEx来解析HTML。 DOM + Xpath是您需要的工具。

DomXpath :: evaluate（）允许使用单个Xpath表达式执行此操作：

$html = <<<'HTML'
<html prefix="og: http://ogp.me/ns#">
  <head>
    <title>Title</title>
    <meta property="og:title" content="OG Title" />
  </head>
</html>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

$title = $xpath->evaluate('concat(/html/head/title, "+", /html/head/meta[@property = "og:title"]/@content)');
var_dump($title);

输出：

string(14) "Title+OG Title"

concat()是一个连接所有参数的xpath函数。如果参数是节点集，则将使用第一个节点的文本内容。

/html/head/title选择标题元素。

/html/head/meta[@property = "og:title"]/@content使用属性属性“og：title”获取元素的content属性。

2 preg匹配VS 1？

3 个答案: