我想使用cURL登录远程域上的网站,然后导航到不同的页面并进行各种数据请求。
问题是在这个网站上有些链接是相对的。这使我的代码认为这些页面是本地的(当然它们不是)。
在挖掘之后,我意识到我需要使用 preg_match 来查找和区分相对链接,并使用 preg_replace 来使它们成为实际存在的.js和。的绝对URL。该服务器上的css文件。
当我运行此代码时,除了少数几个外,它将重新获得所需的链接。
所有链接应该通过的是:
<link rel="stylesheet" type="text/css" href="popcalendar.css">
- &GT;
<link rel="stylesheet" type="text/css" href="http://www.example.com/popcalendar.css">
。 其他相关链接保持不变。我不明白为什么。
正确替换的.css甚至不是第一个应该替换的。
这是我用来尝试访问远程站点的PHP脚本:
<?php
$username = 'myuser';
$password = 'mypass';
$loginUrl = 'http://www.example.com/index.php/';
//init curl
$ch = curl_init();
//Set the URL to work with
curl_setopt($ch, CURLOPT_URL, $loginUrl);
// ENABLE HTTP POST
curl_setopt($ch, CURLOPT_POST, 1);
//Set the post parameters
curl_setopt($ch, CURLOPT_POSTFIELDS, 'uName='.$username.'&uPw='.$password.'&Submit=OK');
//Handle cookies for the login
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
//Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
//not to print out the results of its query.
//Instead, it will return the results as a string return value
//from curl_exec() instead of the usual true/false.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//execute the request (the login)
$store = curl_exec($ch);
//the login is now done and you can continue to get the
//protected content.
//set the URL to the protected file
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/ask_for_info.php');
//execute the request
$result = curl_exec($ch);
curl_close($ch);
if (!preg_match('/src="http?:\/\/"/', $result)) {
$result = preg_replace('/src="(http:\/\/([^\/]+)\/)?([^"]+)"/', "src=\"http://www.example.com/\\3\"", $result);
echo 'THIS';
}
if (!preg_match('/href="http?:\/\/"/', $result)) {
$result = preg_replace('/href="(http:\/\/([^\/]+)\/)?([^"]+)"/', "href=\"http://www.example.com/\\3\"", $result);
echo 'THAT';
}
print_r($result);
?>
在运行代码时检查Google Chrome控制台我得到类似的结果:
Resource interpreted as Stylesheet but transferred with MIME type text/html: "http://example.com/example.css". login4.php:6
Resource interpreted as Script but transferred with MIME type text/html: "http://example.com/js/prototype.js". login4.php:7
Uncaught SyntaxError: Unexpected token < prototype.js:1
Resource interpreted as Script but transferred with MIME type text/html: "http://example.com/js/popcalendar3_ajax.js?ver=2". login4.php:9
Uncaught SyntaxError: Unexpected token <
有什么想法吗?感谢您提供的任何帮助!
答案 0 :(得分:1)
DOMDocument和XPath的示例:
$scheme = 'http';
$host = 'example.com';
$path = '/';
$dom = new DOMDocument();
@$dom->loadHTML($result);
$xpath = new DOMXPath($dom);
$xquery = '//a/@href | //img/@src | //script/@src | //link/@href';
$urlAttrNodes = $xpath->query($xquery);
$pattern = '~^(?!https?:// | www\. | // | ' . preg_quote($host)
. '(?=/|$) ) (\.?/)?~xi';
foreach($urlAttrNodes as $urlAttrNode) {
$absoluteUrl = preg_replace($pattern, "$scheme://www.$host$path",
$urlAttrNode->nodeValue);
$urlAttrNode->ownerElement->setAttribute($urlAttrNode->name, $absoluteUrl);
}
$result = $dom->saveHTML();
请注意,该模式仅跳过当前主机,如果需要,您可以轻松添加其他域。