我正在尝试获取网页内容以提取rss链接。我已经写了以下代码。它获取了网页内容,但它删除了我需要的部分内容!
<?php
function getUrl($url)
{
$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
print_r($ch);
curl_close($ch);
return $file_contents;
}
echo getUrl("http://www.journaltocs.ac.uk/index.php?action=browse&subAction=pub&publisherID=10&local_page=1&sorType=DESC&sorCol=2&pageb=1");
?>
这就是我需要的上述网址,其中包含一个标题=&#34;期刊TOC RSS提要和#34;的链接。
<p style="text-align:left;">Publisher: <b><a href="http://www.law.ed.ac.uk/ahrc" target="_blank"><b>AHRC Research Centre</b></a> <a href="http://www.law.ed.ac.uk/ahrc" title="Publisher Homepage" target="_blank"><img src="images/link_external.png" border="0" style="vertical-align:middle;margin:0;"></a> </b> (Total: 1 journals)</p><table style="width:100%"><tr valign="top"><td style="width:25px;"><input type="checkbox" class="nobox" id="search_result_journal_19827xxx19039" name="journal[]" onclick="process_journal_tick(this, 'my_tocs');" value="19827xxx19039" /></td><td><a href="index.php?action=browse&subAction=pub&publisherID=10&journalID=19827&pageb=1&userQueryID=&sort=&local_page=1&sorType=DESC&sorCol=2">SCRIPTed - A J. of Law, Technology & Society</a> <a href="http://www.law.ed.ac.uk/ahrc/script-ed/index.asp" title="Journal Homepage" target="_blank"><img src="images/layout_elements/triangle.png" border="0" style="vertical-align:middle;margin:0;"></a> <a href="http://feeds.feedburner.com/Script-ed?format=xml" title="Journal TOC RSS feeds" target="_blank"><img src="images/icon_feed.jpg" border="0" style="vertical-align:middle;margin:0;"></a> <img src="images/icon_oa.jpg" border="0" style="vertical-align:middle;margin:0;" title="Open Access" alt="Open Access"> <span style="color:#A8A8A8;">(<span style="color:#808080;">Followers:</span> 7)</span> </td>
</tr></table>
但我从代码中得到的是:
<p style="text-align:left;">Publisher: <b><a href="http://www.law.ed.ac.uk/ahrc" target="_blank"><b>AHRC Research Centre</b></a> <a href="http://www.law.ed.ac.uk/ahrc" title="Publisher Homepage" target="_blank"><img src="images/link_external.png" border="0" style="vertical-align:middle;margin:0;"></a> </b> (Total: 1 journals)</p><table style="width:100%"><tr valign="top"><td style="width:25px;"><input type="checkbox" class="nobox" id="search_result_journal_19827xxx0" name="journal[]" onclick="process_journal_tick(this, 'my_tocs');" value="19827xxx0" /></td><td><a href="index.php?action=browse&subAction=pub&publisherID=10&journalID=19827&pageb=1&userQueryID=&sort=&local_page=1&sorType=DESC&sorCol=2">SCRIPTed - A J. of Law, Technology & Society</a> <a href="http://www.law.ed.ac.uk/ahrc/script-ed/index.asp" title="Journal Homepage" target="_blank"><img src="images/layout_elements/triangle.png" border="0" style="vertical-align:middle;margin:0;"></a> <img src="images/icon_oa.jpg" border="0" style="vertical-align:middle;margin:0;" title="Open Access" alt="Open Access"> <span style="color:#A8A8A8;">(<span style="color:#808080;">Followers:</span> 7)</span> </td>
</tr></table>
如您所见,链接标题为&#34;期刊TOC RSS Feed&#34;已被删除!!!!
我已经使用file_get_content($ url)进行了检查,但它没有帮助! 你能帮我解决一下吗?!我不知道问题是什么!
提前致谢
答案 0 :(得分:1)
function SendCurl($url, $post, $post_data, $user_agent, $cookies){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
if($user_agent)
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
if($post) {
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
}
if($cookies) {
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
} else {
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
}
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array($http, $response);
}
$login_email = "your email";
$login_password = "your password";
$login_url = "http://www.journaltocs.ac.uk/?action=login";
$user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5'; //optional
$login_data = array(
'f_user'=>$login_email,
'f_pass'=>$login_password
);
$webpage_url = "http://www.journaltocs.ac.uk/index.php?action=browse&subAction=pub&publisherID=10&local_page=1&sorType=DESC&sorCol=2&pageb=1";
try{
//login first and save cookies
$response = SendCurl($login_url,true,$login_data,$user_agent);
//if login failed
if( strpos($response[1],"Username or Password is incorrect") )
throw new Exception("Username or Password is incorrect");
//start fetch webpage
$response = SendCurl($webpage_url,false,false,$user_agent,"cookies.txt");
if( strpos($response[1],"Journal TOC RSS feeds") )
die("Journal TOC RSS feeds button is found");
}catch(Exception $e){
die($e->getMessage());
}
RSS Feed图标仅在您记录时显示,
因此,您需要在获取网页内容之前登录