Question

我正在使用cURL来提取远程站点的内容。我需要检查所有“href =”属性并确定它们是相对路径还是绝对路径，然后获取链接的值并将其路径转换为href =“http://www.website.com/index.php ？URL = [ABSOLUTE_PATH]“

非常感谢任何帮助。

Answer 1

正则表达式*和HTML parse_url()的组合应该会有所帮助：

// find all links in a page used within href="" or href='' syntax
$links = array();
preg_match_all('/href=(?:(?:"([^"]+)")|(?:\'([^\']+)\'))/i', $page_contents, $links);

// iterate through each array and check if it's "absolute"
$urls = array();
foreach ($links as $link) {
    $path = $link;
    if ((substr($link, 0, 7) == 'http://') || (substr($link, 0, 8) == 'https://')) {
        // the current link is an "absolute" URL - parse it to get just the path
        $parsed = parse_url($link);
        $path = $parsed['path'];
    }
    $urls[] = 'http://www.website.com/index.php?url=' . $path;
}

要确定网址是否为绝对网址，我只需检查网址的开头是http://还是https://;如果您的网址包含其他媒介，例如ftp://或tel:，您可能还需要处理这些媒体。

此解决方案确实使用正则表达式来解析HTML，这通常是不受欢迎的。为了规避，您可以切换到使用[DOMDocument][2]，但如果没有任何问题则无需额外的代码。

Answer 2

如果我理解正确的问题，这是一个可能的解决方案：

$prefix = 'http://www.website.com/index.php?url=';
$regex = '~(<a.*?href\s*=\s*")(.*?)(".*?>)~is';
$html = file_get_contents('http://cnn.com');

$html = preg_replace_callback($regex, function($input) use ($prefix) {
  $parsed = parse_url($input[2]);

  if (is_array($parsed) && sizeof($parsed) == 1 && isset($parsed['path'])) {
    return $input[1] . $prefix . $parsed['path'] . $input[3];
  }
}, $html);

echo $html;

PHP Regex确定相对或绝对路径

2 个答案: