Question

我正在处理一个完整的HTML文档，我需要提取网址，但只有匹配所需的域

<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>

从那个垃圾部分我需要获得example.com的完整网址，而不是其他部分（notexample.com）。这将是“http://example.com/foo/bar”甚至更好，只有那个网址（bar）女巫的最后一部分当然每次都会有所不同。

希望我已经足够清楚了，非常感谢！

编辑：使用php

Answer 1

你必须避免使用正则表达式来解析这样的HTML。这是一个基于DOM解析器的代码，可以满足您的需求：

$html = <<< EOF
<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a"); // gets all the links
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $val = $node->attributes->getNamedItem('href')->nodeValue;
    if (preg_match('#^https?://example\.com/foo/(.*)$#', $val, $m)) 
       echo "$m[1]\n"; // prints bar
}

正则表达式从html字符串中获取url的一部分

1 个答案: