如何使用PHP / cURL发布ASP.NET登录表单?

时间:2014-08-28 02:27:49

标签: php asp.net forms curl screen-scraping

我需要创建一个使用PHP发布ASP.NET登录表单的工具,以便我可以从用户登录后显示的摘要页面中收集详细信息。

因为网站使用ASP.NET并且表单有__VIEWSTATE和__EVENTVALIDATION隐藏字段,据我所知,我必须先获取这些值,然后在POST中将它们提交到登录表单以便工作。

我是PHP的新手。我创建的脚本应该执行以下操作:

1)获取登录表单并获取__VIEWSTATE和__EVENTVALIDATION

2)使用适当的帖子数据发布到登录表单。

3)获取在我通过身份验证后应该可以访问的summary.htm页面。

实际发生的事情对我来说并不清楚。在POST到登录表单后,我收到了一个cookie,但无法判断该cookie是否表明我已通过身份验证。当我尝试获取summary.htm页面时,我被重定向回登录页面,就像我没有经过身份验证一样。

我是PHP的新手,我希望那些熟悉它的人可能会看到一些我不知道的明显事物。

以下是代码:

<?php

require_once  ("Includes/simple_html_dom.php");

ini_set('display_errors', 'On');
error_reporting(E_ALL);

// Create curl connection
$url = 'https://www.mysite.com/account/login.htm';
$cookieFile = 'cookie.txt';
$ch = curl_init();

// We must request the login page and get the ViewState and EventValidation hidden values
// and pass those along in the post request.

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setOpt($ch, CURLOPT_REFERER, 'https://www.mysite.com/account/login.htm');
curl_setopt($ch, CURLOPT_HTTPHEADER,array('Origin: https://www.mysite.com', 'Host: www.mysite.com'));


$curl_scraped_page = curl_exec($ch);

// Grab ViewState and EventValidation data
$html = str_get_html($curl_scraped_page);
$viewState = $html->find("#__VIEWSTATE", 0);
$eventValidation = $html->find("#__EVENTVALIDATION", 0);
$previousPage = $html->find("#__PREVIOUSPAGE", 0);


//create array of data to be posted
// This matches exactly what I am seeing being posted when looking at Fiddler
$post_data['__EVENTTARGET'] = '';
$post_data['__EVENTARGUMENT'] = '';
$post_data['__VIEWSTATE'] = $viewState->value;
$post_data['__EVENTVALIDATION'] = $eventValidation->value;
$post_data['__PREVIOUSPAGE'] = $previousPage->value;
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$LoginFields$txtUsername'] = 'bsmith';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$LoginFields$txtPassword'] = 'Weez442';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$LoginFields$chkLoginPersist'] = 'on';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$btnLogin'] = 'Login >';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateTopHeader$IncludeHeader$LoginModal$LoginFields$txtModalUsername'] = '';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateTopHeader$IncludeHeader$LoginModal$LoginFields$txtModalPassword'] = '';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateTopHeader$IncludeHeader$SearchForm$inputText'] = '';

//traverse array and prepare data for posting (key1=value1)
foreach ( $post_data as $key => $value) {
    $post_items[] = rawurlencode($key) . '=' . rawurlencode($value);
}

//create the final string to be posted using implode()
$post_string = implode ('&', $post_items);

//Set options for post
curl_setOpt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch,CURLOPT_HTTPHEADER,array('Origin: https://www.mysite.com', 'Host: www.mysite.com', 'Content-Type: application/x-www-form-urlencoded'));
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
curl_setopt($ch, CURLOPT_URL, $url);   
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
curl_setOpt($ch, CURLOPT_REFERER, 'https://www.mysite.com/account/login.htm');

// Perform our post request
$curl_scraped_page = curl_exec($ch);

echo $curl_scraped_page;

// Now get our account summary page
$urlAcctSummary = "https://www.mysite.com/my-account/summary.htm";
//Set options
curl_setOpt($ch, CURLOPT_HTTPGET, TRUE);
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlAcctSummary);   
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); 

$curl_scraped_page = curl_exec($ch);

echo $curl_scraped_page;

curl_close($ch);

?>

1 个答案:

答案 0 :(得分:2)

我明白了。我用几种方式调整了代码,但我相信我的问题的根源是ASP.NET希望从第一个GET请求设置会话cookie,我只在POST请求中指定了CURLOPT_COOKIEJAR,在最终的GET请求中指定了CURLOPT_COOKIEFILE

一旦我将CurLOPT_COOKIEJAR和CURLOPT_COOKIEFILE放入第一个GET请求中,它就按设计工作。

以下是我的代码移动后的样子:

<?php

require_once  ("Includes/simple_html_dom.php");

ini_set('display_errors', 'On');
error_reporting(E_ALL);

// Create curl connection
$url = 'https://www.mysite.com/account/login.htm';
$cookieFile = 'cookie.txt';
$ch = curl_init();

// We must request the login page and get the ViewState and EventValidation hidden values
// and pass those along in the post request.

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setOpt($ch, CURLOPT_REFERER, 'https://www.mysite.com/account/login.htm');
curl_setopt($ch, CURLOPT_HTTPHEADER,array('Origin: https://www.mysite.com', 'Host: www.mysite.com'));
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);


$curl_scraped_page = curl_exec($ch);

// Grab ViewState and EventValidation data
$html = str_get_html($curl_scraped_page);
$viewState = $html->find("#__VIEWSTATE", 0);
$eventValidation = $html->find("#__EVENTVALIDATION", 0);
$previousPage = $html->find("#__PREVIOUSPAGE", 0);


//create array of data to be posted
// This matches exactly what I am seeing being posted when looking at Fiddler
$post_data['__EVENTTARGET'] = '';
$post_data['__EVENTARGUMENT'] = '';
$post_data['__VIEWSTATE'] = $viewState->value;
$post_data['__EVENTVALIDATION'] = $eventValidation->value;
$post_data['__PREVIOUSPAGE'] = $previousPage->value;
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$LoginFields$txtUsername'] = 'bsmith';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$LoginFields$txtPassword'] = 'Weez442';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$LoginFields$chkLoginPersist'] = 'on';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateContent$MyAccountLogin967$btnLogin'] = 'Login >';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateTopHeader$IncludeHeader$LoginModal$LoginFields$txtModalUsername'] = '';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateTopHeader$IncludeHeader$LoginModal$LoginFields$txtModalPassword'] = '';
$post_data['ctl00$ctl00$cphMasterBody$cphPageTemplateTopHeader$IncludeHeader$SearchForm$inputText'] = '';

//traverse array and prepare data for posting (key1=value1)
foreach ( $post_data as $key => $value) {
    $post_items[] = rawurlencode($key) . '=' . rawurlencode($value);
}

//create the final string to be posted using implode()
$post_string = implode ('&', $post_items);

//Set options for post
curl_setOpt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch,CURLOPT_HTTPHEADER,array('Origin: https://www.mysite.com', 'Host: www.mysite.com', 'Content-Type: application/x-www-form-urlencoded'));
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
curl_setopt($ch, CURLOPT_URL, $url);   
curl_setOpt($ch, CURLOPT_REFERER, 'https://www.mysite.com/account/login.htm');

// Perform our post request
$curl_scraped_page = curl_exec($ch);

echo $curl_scraped_page;

// Now get our account summary page
$urlAcctSummary = "https://www.mysite.com/my-account/summary.htm";
//Set options
curl_setOpt($ch, CURLOPT_HTTPGET, TRUE);
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlAcctSummary);   
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);

$curl_scraped_page = curl_exec($ch);

echo $curl_scraped_page;

curl_close($ch);

?>