如何在纯PHP中执行HTTP重定向后获取最终URL?

时间:2010-09-26 18:06:00

标签: php http http-headers

我想做的是了解重定向后的最后/最终网址

我不想使用cURL。我想坚持使用纯PHP(流包装器)。

现在我有一个URL(比方说http://domain.test),我使用get_headers()来获取该页面的特定标题。 get_headers还会返回多个Location:标头(请参阅下面的 编辑 )。有没有办法使用这些标头来构建最终的URL?或者是否有自动执行此操作的PHP函数?

编辑: get_headers()跟随重定向并返回每个响应/重定向的所有标头,因此我拥有所有Location:标头。

6 个答案:

答案 0 :(得分:40)

function getRedirectUrl ($url) {
    stream_context_set_default(array(
        'http' => array(
            'method' => 'HEAD'
        )
    ));
    $headers = get_headers($url, 1);
    if ($headers !== false && isset($headers['Location'])) {
        return $headers['Location'];
    }
    return false;
}

<强> 另外...

正如评论中所提到的,$headers['Location']中的 final 项目将是所有重定向后的最终网址。但重要的是要注意,它不会始终是一个数组。有时它只是一个普通的非数组变量。在这种情况下,尝试访问最后一个数组元素很可能会返回一个字符。不理想。

如果您只对最终的网址感兴趣,那么在所有重定向后,我建议您更改

return $headers['Location'];

return is_array($headers['Location']) ? array_pop($headers['Location']) : $headers['Location'];

...

只是if short-hand
if(is_array($headers['Location'])){
     return array_pop($headers['Location']);
}else{
     return $headers['Location'];
}

此修复程序将处理这两种情况(数组,非数组),并且在调用函数后不需要清除最终的URL。

如果没有重定向,该函数将返回false。同样,该函数也会为无效的URL返回false(由于任何原因无效)。因此,在运行此函数之前,check the URL for validity 非常重要,否则请将重定向检查合并到您的验证中。

答案 1 :(得分:29)

/**
 * get_redirect_url()
 * Gets the address that the provided URL redirects to,
 * or FALSE if there's no redirect. 
 *
 * @param string $url
 * @return string
 */
function get_redirect_url($url){
    $redirect_url = null; 

    $url_parts = @parse_url($url);
    if (!$url_parts) return false;
    if (!isset($url_parts['host'])) return false; //can't process relative URLs
    if (!isset($url_parts['path'])) $url_parts['path'] = '/';

    $sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
    if (!$sock) return false;

    $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
    $request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
    $request .= "Connection: Close\r\n\r\n"; 
    fwrite($sock, $request);
    $response = '';
    while(!feof($sock)) $response .= fread($sock, 8192);
    fclose($sock);

    if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
        if ( substr($matches[1], 0, 1) == "/" )
            return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
        else
            return trim($matches[1]);

    } else {
        return false;
    }

}

/**
 * get_all_redirects()
 * Follows and collects all redirects, in order, for the given URL. 
 *
 * @param string $url
 * @return array
 */
function get_all_redirects($url){
    $redirects = array();
    while ($newurl = get_redirect_url($url)){
        if (in_array($newurl, $redirects)){
            break;
        }
        $redirects[] = $newurl;
        $url = $newurl;
    }
    return $redirects;
}

/**
 * get_final_url()
 * Gets the address that the URL ultimately leads to. 
 * Returns $url itself if it isn't a redirect.
 *
 * @param string $url
 * @return string
 */
function get_final_url($url){
    $redirects = get_all_redirects($url);
    if (count($redirects)>0){
        return array_pop($redirects);
    } else {
        return $url;
    }
}

而且,和往常一样,给予信任:

http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/

答案 2 :(得分:3)

虽然OP希望避免cURL,但最好在可用时使用它。这是一个具有以下优点的解决方案

  • 使用curl进行所有繁重的工作,因此可以使用https
  • 应对返回较低的location标题名称的服务器(xaav和webjay的答案都不会处理此问题)
  • 允许您在放弃之前控制您想要的深度

这是功能:

function findUltimateDestination($url, $maxRequests = 10)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_NOBODY, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, $maxRequests);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);

    //customize user agent if you desire...
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Link Checker)');

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_exec($ch);

    $url=curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

    curl_close ($ch);
    return $url;
}

这是一个更详细的版本,它允许您检查重定向链,而不是让curl跟随它。

function findUltimateDestination($url, $maxRequests = 10)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_NOBODY, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);

    //customize user agent if you desire...
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Link Checker)');

    while ($maxRequests--) {

        //fetch
        curl_setopt($ch, CURLOPT_URL, $url);
        $response = curl_exec($ch);

        //try to determine redirection url
        $location = '';
        if (in_array(curl_getinfo($ch, CURLINFO_HTTP_CODE), [301, 302, 303, 307, 308])) {
            if (preg_match('/Location:(.*)/i', $response, $match)) {
                $location = trim($match[1]);
            }
        }

        if (empty($location)) {
            //we've reached the end of the chain...
            return $url;
        }

        //build next url
        if ($location[0] == '/') {
            $u = parse_url($url);
            $url = $u['scheme'] . '://' . $u['host'];
            if (isset($u['port'])) {
                $url .= ':' . $u['port'];
            }
            $url .= $location;
        } else {
            $url = $location;
        }
    }

    return null;
}

作为此功能处理的重定向链的示例,但其他功能没有,请尝试:

echo findUltimateDestination('http://dx.doi.org/10.1016/j.infsof.2016.05.005')

在撰写本文时,这涉及4个请求,其中包含Locationlocation个标题。

答案 3 :(得分:2)

xaav答案非常好;除了以下两个问题:

  • 它不支持HTTPS协议=&gt;该解决方案是在原始网站中提出的评论:http://w-shadow.com/blog/2008/07/05/how-to-get-redirect-url-in-php/
  • 某些网站无法使用,因为它们无法识别基础用户代理(客户端浏览器) =&GT;只需添加一个User-agent标头字段即可解决此问题:我添加了一个Android用户代理(您可以根据需要在此处找到http://www.useragentstring.com/pages/useragentstring.php其他用户代理示例):

    $ request。=&#34; User-Agent:Mozilla / 5.0(Linux; U; Android 4.0.3; ko-kr; LG-L160L Build / IML74K)AppleWebkit / 534.30(KHTML,类似Gecko)版本/ 4.0 Mobile Safari / 534.30 \ r \ n&#34;;

以下是修改后的答案:

/**
 * get_redirect_url()
 * Gets the address that the provided URL redirects to,
 * or FALSE if there's no redirect. 
 *
 * @param string $url
 * @return string
 */
function get_redirect_url($url){
    $redirect_url = null; 

    $url_parts = @parse_url($url);
    if (!$url_parts) return false;
    if (!isset($url_parts['host'])) return false; //can't process relative URLs
    if (!isset($url_parts['path'])) $url_parts['path'] = '/';

    $sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
    if (!$sock) return false;

    $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
    $request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
    $request .= "User-Agent: Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30\r\n";
    $request .= "Connection: Close\r\n\r\n"; 
    fwrite($sock, $request);
    $response = '';
    while(!feof($sock)) $response .= fread($sock, 8192);
    fclose($sock);

    if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
        if ( substr($matches[1], 0, 1) == "/" )
            return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
        else
            return trim($matches[1]);

    } else {
        return false;
    }

}

/**
 * get_all_redirects()
 * Follows and collects all redirects, in order, for the given URL. 
 *
 * @param string $url
 * @return array
 */
function get_all_redirects($url){
    $redirects = array();
    while ($newurl = get_redirect_url($url)){
        if (in_array($newurl, $redirects)){
            break;
        }
        $redirects[] = $newurl;
        $url = $newurl;
    }
    return $redirects;
}

/**
 * get_final_url()
 * Gets the address that the URL ultimately leads to. 
 * Returns $url itself if it isn't a redirect.
 *
 * @param string $url
 * @return string
 */
function get_final_url($url){
    $redirects = get_all_redirects($url);
    if (count($redirects)>0){
        return array_pop($redirects);
    } else {
        return $url;
}

答案 4 :(得分:0)

已从答案@xaav和@Houssem BDIOUI添加到代码中:404错误情况以及URL无响应时的情况。在这种情况下,get_final_url($url)返回字符串:“错误:找不到404”和“错误:没有响应”。

/**
 * get_redirect_url()
 * Gets the address that the provided URL redirects to,
 * or FALSE if there's no redirect,
 * or 'Error: No Responce',
 * or 'Error: 404 Not Found'
 *
 * @param string $url
 * @return string
 */
function get_redirect_url($url)
{
    $redirect_url = null;

    $url_parts = @parse_url($url);
    if (!$url_parts)
        return false;
    if (!isset($url_parts['host']))
        return false; //can't process relative URLs
    if (!isset($url_parts['path']))
        $url_parts['path'] = '/';

    $sock = @fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
    if (!$sock) return 'Error: No Responce';

    $request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?' . $url_parts['query'] : '') . " HTTP/1.1\r\n";
    $request .= 'Host: ' . $url_parts['host'] . "\r\n";
    $request .= "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36\r\n";
    $request .= "Connection: Close\r\n\r\n";
    fwrite($sock, $request);
    $response = '';
    while (!feof($sock))
        $response .= fread($sock, 8192);
    fclose($sock);

    if (stripos($response, '404 Not Found') !== false)
    {
        return 'Error: 404 Not Found';
    }

    if (preg_match('/^Location: (.+?)$/m', $response, $matches))
    {
        if (substr($matches[1], 0, 1) == "/")
            return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
        else
            return trim($matches[1]);

    } else
    {
        return false;
    }

}

/**
 * get_all_redirects()
 * Follows and collects all redirects, in order, for the given URL.
 *
 * @param string $url
 * @return array
 */
function get_all_redirects($url)
{
    $redirects = array();
    while ($newurl = get_redirect_url($url))
    {
        if (in_array($newurl, $redirects))
        {
            break;
        }
        $redirects[] = $newurl;
        $url = $newurl;
    }
    return $redirects;
}

/**
 * get_final_url()
 * Gets the address that the URL ultimately leads to.
 * Returns $url itself if it isn't a redirect,
 * or 'Error: No Responce'
 * or 'Error: 404 Not Found',
 *
 * @param string $url
 * @return string
 */
function get_final_url($url)
{
    $redirects = get_all_redirects($url);
    if (count($redirects) > 0)
    {
        return array_pop($redirects);
    } else
    {
        return $url;
    }
}

答案 5 :(得分:0)

经过数小时阅读 Stackoverflow 并尝试了人们编写的所有自定义函数以及尝试了所有 cURL 建议后,我只做了 1 次重定向,我设法实现了自己的逻辑。

$url = 'facebook.com';
// First let's find out if we just typed the domain name alone or we prepended with a protocol 
if (preg_match('/(http|https):\/\/[a-z0-9]+[a-z0-9_\/]*/',$url)) {
    $url = $url;
} else {
    $url = 'http://' . $url;
    echo '<p>No protocol given, defaulting to http://';
}
// Let's print out the initial URL
echo '<p>Initial URL: ' . $url . '</p>';
// Prepare the HEAD method when we send the request
stream_context_set_default(array('http' => array('method' => 'HEAD')));
// Probe for headers
$headers = get_headers($url, 1);
// If there is a Location header, trigger logic
if (isset($headers['Location'])) {
    // If there is more than 1 redirect, Location will be array
    if (is_array($headers['Location'])) {
        // If that's the case, we are interested in the last element of the array (thus the last Location)
        echo '<p>Redirected URL: ' . $headers['Location'][array_key_last($headers['Location'])] . '</p>';
        $url = $headers['Location'][array_key_last($headers['Location'])];
    } else {
        // If it's not an array, it means there is only 1 redirect
        //var_dump($headers['Location']);
        echo '<p>Redirected URL: ' . $headers['Location'] . '</p>';
        $url = $headers['Location'];
    }
} else {
    echo '<p>URL: ' . $url . '</p>';
}
// You can now send get_headers to the latest location
$headers = get_headers($url, 1);