我尝试使用curl从网站上删除某些日期。这是我的代码:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent());
$result7 = htmlspecialchars_decode(curl_exec ($ch));
curl_close($ch);
$html7 = new simple_html_dom();
$html7->load($result7);
但是我有以下警告错误:
警告:file_get_contents(<!DOCTYPE html>< html xmlns:mml =“http://www.w3.org/1998/Math/MathML" lang =”en“>< head>< script type =”text / javascript“> var JiffyParams = {jsStart:(new Date())。getTime()};< / script>< meta name =“robots”content =“noarchive,noindex,nofollow,NOODP”/>< meta name = “MSSmartTagsPreventParsing”content =“true”/>< title> JSTOR:发生错误设置用户Cookie< / title>< meta charset =“UTF-8”/>< link rel =“快捷图标” href =“/ templates / jsp / favicon.ico”type =“image / vnd.microsoft.icon”/>< link rel =“stylesheet”type =“text / css”media =“screen”href =“/ jawrcss / N815843185 / bundles / jstor.css“/>< link rel =”stylesheet“type =”text / css“href =”// fonts.googleapis.com/css?family=Roboto:400,5 in C :第76行的\ wamp \ www \ scrap_cairn \ simple_html_dom.php
我不明白我的错误是什么,我是Curl的初学者...也许我必须从Jstor设置一些cookie,但我不知道该怎么办。谢谢你的帮助。
编辑:
我刚添加了这个并且错误已更改:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent());
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
$result7 = htmlspecialchars_decode(curl_exec ($ch));
curl_close($ch);
错误:
警告:file_get_contents(<!DOCTYPE html><! - [if IE 8]>< html class =“no-js lt-ie9”lang =“en”><![endif ] - ><! - [if gt IE 8]><! - >< html class =“no-js”lang =“en”><! - <! [endif] - >< head>< script type =“text / javascript”>(window.NREUM ||(NREUM = {}))。loader_config = {xpid:“VwACUF9VGwsGXVRbAwA =”};窗口。 NREUM ||(NREUM = {}),__ nr_require = function(t,e,n){function r(n){if(!e [n]){var o = e [n] = {exports:{}} ; t [n] [0] .call(o.exports,function(e){var o = t [n] [1] [e]; return r(o?o:e)},o,o.exports )} return e [n] .exports} if(“function”== typeof __nr_require)return __nr_require; for(var o = 0; o< n.length; o ++)r(n [o]); return r}( {QJf3ax:[function(t,e){function n(t){function e(e,n,a){t& t(e,n,a),a ||(a = {}); for(var c = s(e),f = c.length,u = i(a,o,r),d = 0; f> d; d ++)c [d] .apply(u,n); return函数a(t,e){f [t] = s(t).concat(e)}函数s(t){return f [t] || []}函数c(){return n(e )} var f = {}; return {on:a,emit:e,create:c,listeners:s,_events:in C:\ wamp \ www \ scrap_cairn \ simple第76行的_html_dom.php
我在simple_html_dom中添加了关于第76行的代码:
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
答案 0 :(得分:0)
Cookie是一种浏览器。
curl是一个系统事物(bash或linux或其他东西)。
php包含curl(有时实际上编译库中的库)。它或多或少是系统调用(不涉及浏览器)
因此,您需要使用curl设置Cookie:
http://curl.haxx.se/docs/http-cookies.html
但你是对的 -
答案 1 :(得分:0)
您确定file_get_html()
是做到这一点的好方法吗?这个函数调用file_get_contents(),它打开一个URI,然后传递一个字符串(包含你的HTML数据)。
我认为来自PHP简单HTML DOM的str_get_html()将是一个好方法。