我想从HTML页面中选择所有URL到以下数组:
This is a webpage <a href="http://somesite.com/link1.php">with</a>
different kinds of <a href="http://somesite.com/link1.php"><img src="someimg.png"></a>
我想要的输出是:
with => http://somesite.se/link1.php
现在我明白了:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
我不希望在开头和结尾之间包含图片的网址/链接。只有带文字的那些。
我目前的代码是:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
@$dom->loadHTML($html); // @ = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>
答案 0 :(得分:-1)
在条件中添加第二个preg_match
:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;