preg_replace避免了一些标签

时间:2014-05-27 12:42:08

标签: php curl preg-replace preg-match

我想使用cURL登录远程域上的网站,然后导航到不同的页面并进行各种数据请求。

问题是在这个网站上有些链接是相对的。这使我的代码认为这些页面是本地的(当然它们不是)。

在挖掘之后,我意识到我需要使用 preg_match 来查找和区分相对链接,并使用 preg_replace 来使它们成为实际存在的.js和。的绝对URL。该服务器上的css文件。

当我运行此代码时,除了少数几个外,它将重新获得所需的链接。 所有链接应该通过的是:
<link rel="stylesheet" type="text/css" href="popcalendar.css"> - &GT; <link rel="stylesheet" type="text/css" href="http://www.example.com/popcalendar.css">其他相关链接保持不变。我不明白为什么。 正确替换的.css甚至不是第一个应该替换的。

这是我用来尝试访问远程站点的PHP脚本:

<?php
$username = 'myuser';
$password = 'mypass';
$loginUrl = 'http://www.example.com/index.php/';

//init curl
$ch = curl_init();

//Set the URL to work with
curl_setopt($ch, CURLOPT_URL, $loginUrl);

// ENABLE HTTP POST
curl_setopt($ch, CURLOPT_POST, 1);

//Set the post parameters
curl_setopt($ch, CURLOPT_POSTFIELDS, 'uName='.$username.'&uPw='.$password.'&Submit=OK');

//Handle cookies for the login
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');

//Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
//not to print out the results of its query.
//Instead, it will return the results as a string return value
//from curl_exec() instead of the usual true/false.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

//execute the request (the login)
$store = curl_exec($ch);

//the login is now done and you can continue to get the
//protected content.

//set the URL to the protected file
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/ask_for_info.php');

//execute the request
$result = curl_exec($ch);
curl_close($ch);
if (!preg_match('/src="http?:\/\/"/', $result)) {
        $result = preg_replace('/src="(http:\/\/([^\/]+)\/)?([^"]+)"/', "src=\"http://www.example.com/\\3\"", $result);
        echo 'THIS';
    }
    if (!preg_match('/href="http?:\/\/"/', $result)) {
        $result = preg_replace('/href="(http:\/\/([^\/]+)\/)?([^"]+)"/', "href=\"http://www.example.com/\\3\"", $result);
        echo 'THAT';
    }


print_r($result);
?>

在运行代码时检查Google Chrome控制台我得到类似的结果:

Resource interpreted as Stylesheet but transferred with MIME type text/html: "http://example.com/example.css". login4.php:6
Resource interpreted as Script but transferred with MIME type text/html: "http://example.com/js/prototype.js". login4.php:7
Uncaught SyntaxError: Unexpected token < prototype.js:1
Resource interpreted as Script but transferred with MIME type text/html: "http://example.com/js/popcalendar3_ajax.js?ver=2". login4.php:9
Uncaught SyntaxError: Unexpected token < 

有什么想法吗?感谢您提供的任何帮助!

1 个答案:

答案 0 :(得分:1)

DOMDocument和XPath的示例:

$scheme = 'http';
$host = 'example.com';
$path = '/';

$dom = new DOMDocument();
@$dom->loadHTML($result);
$xpath = new DOMXPath($dom);

$xquery = '//a/@href | //img/@src | //script/@src | //link/@href';
$urlAttrNodes = $xpath->query($xquery);

$pattern = '~^(?!https?:// | www\. | // | ' . preg_quote($host)
         . '(?=/|$) )  (\.?/)?~xi';

foreach($urlAttrNodes as $urlAttrNode) {
    $absoluteUrl = preg_replace($pattern, "$scheme://www.$host$path",
                                $urlAttrNode->nodeValue);
    $urlAttrNode->ownerElement->setAttribute($urlAttrNode->name, $absoluteUrl);
}

$result = $dom->saveHTML();

请注意,该模式仅跳过当前主机,如果需要,您可以轻松添加其他域。