Question

我想从HTML页面中选择所有URL到以下数组：

This is a webpage <a href="http://somesite.com/link1.php">with</a> 
different kinds of <a href="http://somesite.com/link1.php"><img src="someimg.png"></a>

我想要的输出是：

with => http://somesite.se/link1.php

现在我明白了：

<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php

我不希望在开头和结尾之间包含图片的网址/链接。只有带文字的那些。

我目前的代码是：

<?php

function innerHTML($node) {
    $ret = '';

    foreach ($node->childNodes as $node) {
        $ret .= $node->ownerDocument->saveHTML($node);
    }

    return $ret;
}

$html = file_get_contents('http://somesite.com/'.$_GET['apt']);

$dom = new DOMDocument;
@$dom->loadHTML($html); // @ = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();

foreach ($links as $link) {
    //$node = $link->nodeValue;
    $node = innerHTML($link);
    $href = $link->getAttribute('href');

    if (preg_match('/\.pdf$/i', $href))
            $result[$node] = $href;
}

print_r($result);

?>

Answer 1

在条件中添加第二个preg_match：

if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;

PHP：DOM获取网址和锚点（但不是IMG）

1 个答案: