Question

我尝试使用curl从网站上删除某些日期。这是我的代码：

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent());
$result7 = htmlspecialchars_decode(curl_exec ($ch));
curl_close($ch);

$html7 = new simple_html_dom();
$html7->load($result7);

但是我有以下警告错误：

警告：file_get_contents（＆lt;！DOCTYPE html＆gt;＆lt; html xmlns：mml =“http://www.w3.org/1998/Math/MathML" lang =”en“＆gt;＆lt; head＆gt;＆lt; script type =”text / javascript“＆gt; var JiffyParams = {jsStart：（new Date（））。getTime（）};＆lt; / script＆gt;＆lt; meta name =“robots”content =“noarchive，noindex，nofollow，NOODP”/＆gt;＆lt; meta name = “MSSmartTagsPreventParsing”content =“true”/＆gt;＆lt; title＆gt; JSTOR：发生错误设置用户Cookie＆lt; / title＆gt;＆lt; meta charset =“UTF-8”/＆gt;＆lt; link rel =“快捷图标” href =“/ templates / jsp / favicon.ico”type =“image / vnd.microsoft.icon”/＆gt;＆lt; link rel =“stylesheet”type =“text / css”media =“screen”href =“/ jawrcss / N815843185 / bundles / jstor.css“/＆gt;＆lt; link rel =”stylesheet“type =”text / css“href =”// fonts.googleapis.com/css?family=Roboto:400,5 in C ：第76行的\ wamp \ www \ scrap_cairn \ simple_html_dom.php

我不明白我的错误是什么，我是Curl的初学者...也许我必须从Jstor设置一些cookie，但我不知道该怎么办。谢谢你的帮助。

编辑：

我刚添加了这个并且错误已更改：

$ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://www.jstor.org/action/doBasicSearch?Query=Les+bourgeois');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_USERAGENT, random_user_agent());
    curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
    curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
    $result7 = htmlspecialchars_decode(curl_exec ($ch));
    curl_close($ch);

错误：

警告：file_get_contents（＆lt;！DOCTYPE html＆gt;＆lt;！ - [if IE 8]＆gt;＆lt; html class =“no-js lt-ie9”lang =“en”＆gt;＆lt;！[endif ] - ＆gt;＆lt;！ - [if gt IE 8]＆gt;＆lt;！ - ＆gt;＆lt; html class =“no-js”lang =“en”＆gt;＆lt;！ - ＆lt;！ [endif] - ＆gt;＆lt; head＆gt;＆lt; script type =“text / javascript”＆gt;（window.NREUM ||（NREUM = {}））。loader_config = {xpid：“VwACUF9VGwsGXVRbAwA =”};窗口。 NREUM ||（NREUM = {}），__ nr_require = function（t，e，n）{function r（n）{if（！e [n]）{var o = e [n] = {exports：{}} ; t [n] [0] .call（o.exports，function（e）{var o = t [n] [1] [e]; return r（o？o：e）}，o，o.exports ）} return e [n] .exports} if（“function”== typeof __nr_require）return __nr_require; for（var o = 0; o＆lt; n.length; o ++）r（n [o]）; return r}（ {QJf3ax：[function（t，e）{function n（t）{function e（e，n，a）{t＆amp; t（e，n，a），a ||（a = {}）; for（var c = s（e），f = c.length，u = i（a，o，r），d = 0; f> d; d ++）c [d] .apply（u，n）; return函数a（t，e）{f [t] = s（t）.concat（e）}函数s（t）{return f [t] || []}函数c（）{return n（e ）} var f = {}; return {on：a，emit：e，create：c，listeners：s，_events：in C：\ wamp \ www \ scrap_cairn \ simple第76行的_html_dom.php

我在simple_html_dom中添加了关于第76行的代码：

    function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    // We DO force the tags to be terminated.
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
    $contents = file_get_contents($url, $use_include_path, $context, $offset);
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
    //$contents = retrieve_url_contents($url);
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
    {
        return false;
    }
    // The second parameter can force the selectors to all be lowercase.
    $dom->load($contents, $lowercase, $stripRN);
    return $dom;
}

Answer 1

Cookie是一种浏览器。

curl是一个系统事物（bash或linux或其他东西）。

php包含curl（有时实际上编译库中的库）。它或多或少是系统调用（不涉及浏览器）

因此，您需要使用curl设置Cookie：

http://curl.haxx.se/docs/http-cookies.html

但你是对的 -

Answer 2

您确定file_get_html()是做到这一点的好方法吗？这个函数调用file_get_contents()，它打开一个URI，然后传递一个字符串（包含你的HTML数据）。

我认为来自PHP简单HTML DOM的str_get_html()将是一个好方法。

带卷曲的废料：错误集Cookie

2 个答案: