Question

我正在尝试编写一个脚本来从远程URL中删除规范URL。我不是一个专业的开发人员，所以如果我的代码中的某些内容很难看，任何解释都会（并且将会）受到赞赏。

我要做的是寻找：

<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />
<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />`

...并从中提取网址。

到目前为止我的代码：

    $content = file_get_contents($url);
    $content = strtolower($content);
    $content = preg_replace("'<style[^>]*>.*</style>'siU",'',$content);  // strip js
    $content = preg_replace("'<script[^>]*>.*</script>'siU",'',$content); // strip css
    $split = explode("\n",$content); // Separate each line

    foreach ($split as $k => $v) // For each line
    {
        if (strpos(' '.$v,'<meta') || strpos(' '.$v,'<link')) // If contains a <meta or <link
        {
        // Check with regex and if found, return what I need (the URL)
        }
    }
    return $split_content;

我一直在与正则表达式斗争数小时，试图弄清楚如何这样做，但它似乎远远超出我的知识。

有人知道我需要如何定义此规则吗？另外，我的脚本对您来说是否合适，还是有改进的余地？

非常感谢！

Answer 1

考虑使用DOMDocument，只需将HTML加载到DOMDocument对象中并使用getElementsByTagName然后循环结果，直到其中一个具有正确的属性。好像你在写Javascript。

Answer 2

使用DOMDocument这就是获取属性和内容的方法

$html = '<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('meta') as $meta) {
    if ($meta->hasAttributes()) {
        foreach ($meta->attributes as $attribute) {
            $attr[$attribute->nodeName] = $attribute->nodeValue;
        }
    }
}

print_r($attr);

输出::

Array
(
    [property] => og:url
    [content] => http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html
)

您可以获得第二个网址

$html = '<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('link') as $link) {
    if ($link->hasAttributes()) {
        foreach ($link->attributes as $attribute) {
            $attr[$attribute->nodeName] = $attribute->nodeValue;
        }
    }
}


print_r($attr);

输出::

Array
(
    [rel] => canonical
    [href] => http://www.another-canonical-url.com/is-here
)

用于检索og：url meta或link rel =“canonical”的正则表达式

2 个答案: