Question

我正在尝试使用preg_match_all从HTML代码块中提取所有网址。我也试图忽略所有图像。

示例HTML块：

$html = '<p>This is a test</p><br>http://www.facebook.com<br><img src="http://www.google.com/photo.jpg">www.yahoo.com https://www.aol.com<br>';

我正在使用以下内容尝试仅构建一个URLS数组。（不是图像）

if(preg_match_all('~(?:(?:https://)|(?:http://)|(?:www\.))(?![^" ]*(?:jpg|png|gif|"))[^" <>]+~', $html, $links))
{ 
 print_r($links); 
}

在上面的示例中，$ links数组应包含：

http://www.facebook.com, www.yahoo.com, https://www.aol.com

Google被排除在外，因为它包含.jpg图片扩展名。当我将这样的图像添加到$ html：

时，会出现问题

<img src="http://www.google.com/image%201.jpg">

似乎百分号会导致preg_match拆分URL并提取以下“链接”。

http://www.google.com/image

任何想法如何只抓取不是图像的网址？（即使它们包含网址通常具有的特殊字符）

Answer 1

使用DOM可以识别HTML文档的结构。在您的情况下，要识别要从中获取网址的部分。

使用DOM加载HTML
使用Xpath从链接href属性中获取URL（仅当您需要它们时）
使用Xpath
在文本节点值上使用RegEx以匹配网址

以下是一个示例实现：

$html = <<<'HTML'
  <p>This is a test</p>
  <br>
  http://www.facebook.com
  <br>
  <img src="http://www.google.com/photo.jpg">
  www.yahoo.com 
  https://www.aol.com
  <a href="http://www.google.com">Link</a>
  <!-- http://comment.ingored.url -->
  <br>
HTML;

$urls = array();

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

// fetch urls from link href attributes
foreach ($xpath->evaluate('//a[@href]/@href') as $href) {
  $urls[] = $href->value;
}

// fetch urls inside text nodes
$pattern = '(
 (?:(?:https?://)|(?:www\.))
 (?:[^"\'\\s]+)
)xS';
foreach ($xpath->evaluate('/html/body//text()') as $text) {
  $matches = array();
  preg_match_all($pattern, $text->nodeValue, $matches);
  foreach ($matches[0] as $href) {
    $urls[] = $href;
  }
}

var_dump($urls);

输出：

array(4) {
  [0]=>
  string(21) "http://www.google.com"
  [1]=>
  string(23) "http://www.facebook.com"
  [2]=>
  string(13) "www.yahoo.com"
  [3]=>
  string(19) "https://www.aol.com"
}

PHP Regex匹配URL但与Image不匹配

1 个答案: