我正在尝试抓取隐藏在标准登录表单后面的网站的内容(如果重要的话,请在我的网站和目标网站上使用HTTPS)。
我可以通过执行POST
请求成功登录该页面:
include("inc/simple_html_dom.php");
$url = "https://account.tfl.gov.uk/Login";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$cookie = 'cookies.txt';
$timeout = 60;
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch,CURLOPT_POSTFIELDS,"UserName=USER&Password=PASSWORD&AppId=00000000-0000-0000-0000-000000000000&ReturnUrl=");
$result = curl_exec($ch);
然后我希望能够在登录后抓取用户的旅程历史记录(https://oyster.tfl.gov.uk/oyster/journeyHistoryThrottle.do?_qs=_qv=[SESSION CODE]。要获取会话代码,我使用SimpleHTMLDom:
$html = str_get_html($result);
$codeRaw = $html->getElementById('Oyster')->childNodes(1);
$code1 = explode("?_o=",$codeRaw);
$code2 = explode('"',$code1[1]);
$codeReal = $code2[0];
然后我尝试通过执行另一个cURL请求来访问该页面:
$url = "https://oyster.tfl.gov.uk/oyster/journeyHistoryThrottle.do?_qs=_qv=".$codeReal;
echo $url;
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$cookie = 'cookies.txt';
$timeout = 60;
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
$result = str_replace('"/','"https://oyster.tfl.gov.uk/',curl_exec($ch));
curl_close($ch);
echo $result;
但我得到的只是一个登录页面 - 我怀疑是因为两个cURL请求生成了不同的"会话"在TfL网站上??
有没有办法强制cURL使用上一个会话?如果相关,我可能还需要在浏览旅程历史记录时进一步请求。
还是以其他方式实现这一目标? (为此目的,TfL没有提供API)
答案 0 :(得分:1)
对于简单的会话处理,只需将CURLOPT_COOKIEFILE
选项设置为空字符串即可。请参阅documentation。
我看到的几个可能的问题。您的两个网址位于不同的主机上,是故意的,如果是,您确定会在oyster.tfl.gov.uk上读取来自account.tfl.gov.uk的Cookie吗?您没有在第二个URL上将方法从POST设置为GET。我认为这是一个错误,因为第二次检索没有发布数据,并在下面进行了更正。
另外值得一提的是,您可能无法以最有效的方式获取$codeReal
,但我无法看到您正在使用的HTML。所有explode()
表明可能有更好的方法!
<?php
include("inc/simple_html_dom.php");
$url = "https://account.tfl.gov.uk/Login";
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_URL=>$url,
CURLOPT_FOLLOWLOCATION=>true,
CURLOPT_TIMEOUT=>10,
CURLOPT_CONNECTTIMEOUT=>60,
CURLOPT_COOKIEFILE=>"",
CURLOPT_POST=>true,
CURLOPT_POSTFIELDS=>[
"UserName"=>"USER",
"Password"=>"PASSWORD",
"AppId"=>"00000000-0000-0000-0000-000000000000",
"ReturnUrl"=>"",
],
];
$result = curl_exec($ch);
// get your code, be sure to escape it
$html = str_get_html($result);
$codeRaw = $html->getElementById('Oyster')->childNodes(1);
$code1 = explode("?_o=",$codeRaw);
$code2 = explode('"',$code1[1]);
$codeReal = $code2[0];
$codeReal = urlencode($codeReal);
$url = "https://oyster.tfl.gov.uk/oyster/journeyHistoryThrottle.do?_qs=_qv=$codeReal";
// most of your options are the same, just change URL and disable POST
curl_setopt_array($ch, [
CURLOPT_URL=>$url,
CURLOPT_POST=>false,
];
$result = curl_exec($ch);
curl_close($ch);