Question

我正在解析外部文档并使其中的所有链接都是绝对的。例如：

    <link rel="stylesheet" type="text/css" href="/css/style.css" />

将替换为：

    <link rel="stylesheet" type="text/css" href="http://www.hostsite.com/css/style.css" />

其中http://www.hostsite.com是文档的基本URL。

这是我尝试过的失败者：

    $linkfix1 = str_replace('href=\"\/', 'href=\"$url\/', $code);

网站上有几个与在单个URL字符串上进行此替换有关的问题，但我找不到任何有关文档中嵌入的URL的工作。关于如何使所有这些链接绝对是否有任何好的建议？

Answer 1

您不需要在使用单引号的字符串中转义双引号。

你根本不需要向前斜杠。

你只想：

str_replace('href="', 'href="http://hostsite.com', $replace_me);

为了安全起见，您不要用hostsite替换每个链接：

str_replace('href="/css/', 'href="http://hostsite.com/css/', $replace_me);

Answer 2

公共服务公告：不要使用正则表达式来重写格式化文档的元素。

执行此操作的正确方法是将文档作为实体（DOMDocument或SimpleXMLElement）加载，并根据节点和值进行处理。原始解决方案也没有处理src标记或解析基础相对网址（例如/css/style.css）。

这是一个最合适的解决方案，如果需要可以扩展：

# Example URL
$url = "http://www.stackoverflow.com/";

# Get the root and current directory
$pattern = "/(.*\/\/[^\/]+\/)([^?#]*\/)?/";
/*  The pattern has two groups: one for the domain (anything before
    the first two slashes, the slashes, anything until the next slash,
    and the next slash) and one for the current directory (anything
    that isn't an anchor or query string, then the last slash before
    any anchor or query string).  This yields:
    - [0]: http://stackoverflow.com/question/123412341234
    - [1]: http://stackoverflow.com/
    - [2]: question/
    We only need [0] (the entire match) and [1] (just the first group).
*/
$matches = array();
preg_match($pattern, $url, $matches);
$cd = $matches[0];
$root = $matches[1];

# Normalizes the URL on the provided element's attribute
function normalizeAttr($element, $attr){
    global $pattern, $cd, $root;
    $href = $element->getAttribute($attr);
    # If this is an external URL, ignore
    if(preg_match($pattern, $href))
        return;
    # If this is a base-relative URL, prepend the base
    elseif(substr($href, 0, 1) == '/')
        $element->setAttribute($attr, $root . substr($href, 1));
    # If this is a relative URL, prepend the current directory
    elseif(substr($href, 0, strlen($cd)) != $cd)
        $element->setAttribute($attr, $cd . $href);
}

# Load in the data, ignoring HTML5 errors
$page = new DOMDocument();
libxml_use_internal_errors(true);
$page->loadHTMLFile($url);
libxml_use_internal_errors(false);
$page->normalizeDocument();

# Normalize <link href="..."/>
foreach($page->getElementsByTagName('link') as $link)
    normalizeAttr($link, 'href');
# Normalize <a href="...">...</a>
foreach($page->getElementsByTagName('a') as $anchor)
    normalizeAttr($anchor, 'href');
# Normalize <img src="..."/>
foreach($page->getElementsByTagName('img') as $image)
    normalizeAttr($image, 'src');
# Normalize <script src="..."></script>
foreach($page->getElementsByTagName('script') as $script)
    normalizeAttr($script, 'src');

# Render normalized data
print $page->saveHTML();

修复PHP中的相对链接

2 个答案: