PHP cURL - 来自同一用户'

时间:2017-02-15 21:43:58

标签: php curl web-scraping

我正在尝试抓取隐藏在标准登录表单后面的网站的内容(如果重要的话,请在我的网站和目标网站上使用HTTPS)。

我可以通过执行POST请求成功登录该页面:

include("inc/simple_html_dom.php");

$url = "https://account.tfl.gov.uk/Login";

$ch = curl_init();    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

curl_setopt($ch, CURLOPT_URL, $url);
$cookie = 'cookies.txt';
$timeout = 60;

curl_setopt($ch, CURLOPT_FOLLOWLOCATION,  1);
curl_setopt($ch, CURLOPT_TIMEOUT,         10); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,  $timeout);
curl_setopt($ch, CURLOPT_COOKIEJAR,       $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE,      $cookie);

curl_setopt ($ch, CURLOPT_POST, 1); 
curl_setopt ($ch,CURLOPT_POSTFIELDS,"UserName=USER&Password=PASSWORD&AppId=00000000-0000-0000-0000-000000000000&ReturnUrl=");     

$result = curl_exec($ch);

然后我希望能够在登录后抓取用户的旅程历史记录(https://oyster.tfl.gov.uk/oyster/journeyHistoryThrottle.do?_qs=_qv=[SESSION CODE]。要获取会话代码,我使用SimpleHTMLDom:

$html = str_get_html($result);
$codeRaw = $html->getElementById('Oyster')->childNodes(1);
$code1 = explode("?_o=",$codeRaw);
$code2 = explode('"',$code1[1]);
$codeReal = $code2[0];

然后我尝试通过执行另一个cURL请求来访问该页面:

$url = "https://oyster.tfl.gov.uk/oyster/journeyHistoryThrottle.do?_qs=_qv=".$codeReal;

echo $url;

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

curl_setopt($ch, CURLOPT_URL, $url);
$cookie = 'cookies.txt';
$timeout = 60;

curl_setopt($ch, CURLOPT_FOLLOWLOCATION,  1);
curl_setopt($ch, CURLOPT_TIMEOUT,         10); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,  $timeout);
curl_setopt($ch, CURLOPT_COOKIEJAR,       $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE,      $cookie);

$result = str_replace('"/','"https://oyster.tfl.gov.uk/',curl_exec($ch));

curl_close($ch); 
echo $result;

但我得到的只是一个登录页面 - 我怀疑是因为两个cURL请求生成了不同的"会话"在TfL网站上??

有没有办法强制cURL使用上一个会话?如果相关,我可能还需要在浏览旅程历史记录时进一步请求。

还是以其他方式实现这一目标? (为此目的,TfL没有提供API)

1 个答案:

答案 0 :(得分:1)

对于简单的会话处理,只需将CURLOPT_COOKIEFILE选项设置为空字符串即可。请参阅documentation

中的详情

我看到的几个可能的问题。您的两个网址位于不同的主机上,是故意的,如果是,您确定会在oyster.tfl.gov.uk上读取来自account.tfl.gov.uk的Cookie吗?您没有在第二个URL上将方法从POST设置为GET。我认为这是一个错误,因为第二次检索没有发布数据,并在下面进行了更正。

另外值得一提的是,您可能无法以最有效的方式获取$codeReal,但我无法看到您正在使用的HTML。所有explode()表明可能有更好的方法!

<?php
include("inc/simple_html_dom.php");

$url = "https://account.tfl.gov.uk/Login";

$ch = curl_init();    
curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER=>true,
    CURLOPT_URL=>$url,
    CURLOPT_FOLLOWLOCATION=>true,
    CURLOPT_TIMEOUT=>10,
    CURLOPT_CONNECTTIMEOUT=>60,
    CURLOPT_COOKIEFILE=>"",
    CURLOPT_POST=>true,
    CURLOPT_POSTFIELDS=>[
        "UserName"=>"USER",
        "Password"=>"PASSWORD",
        "AppId"=>"00000000-0000-0000-0000-000000000000",
        "ReturnUrl"=>"",
    ],
];
$result = curl_exec($ch);

// get your code, be sure to escape it
$html = str_get_html($result);
$codeRaw = $html->getElementById('Oyster')->childNodes(1);
$code1 = explode("?_o=",$codeRaw);
$code2 = explode('"',$code1[1]);
$codeReal = $code2[0];

$codeReal = urlencode($codeReal);

$url = "https://oyster.tfl.gov.uk/oyster/journeyHistoryThrottle.do?_qs=_qv=$codeReal";

// most of your options are the same, just change URL and disable POST
curl_setopt_array($ch, [
    CURLOPT_URL=>$url,
    CURLOPT_POST=>false,
];
$result = curl_exec($ch);
curl_close($ch);