我正在尝试使用cUrl登录网站并从该网站抓取某些数据。这是一个家庭作业项目。但是该网站有3种不同的表单数据,每次我登录时都会更改。
是否可以绕过该程序并登录,或者只是不可能?如果是这样,有人可以帮助我朝正确的方向开始吗?
我尝试过的cURL代码是:
<?php
include("simple_html_dom.php");
$cofile = dirname(__FILE__).'/cookie.txt';
$postfield= array(
"SM"=>"UpPnlLogin|btnLogin",
"__LASTFOCUS"=>"",
"__EVENTTARGET"=>"btnLogin",
"__EVENTARGUMENT"=>"",
"__VIEWSTATE"=>"hly8ipIDyvfEpBj01vjkB/HmrA
yIw+UuyvBkGc5NHMexWF+PvAVQZYkSrcwJM4rO9aaz
93ogQuFxowVMDPueJz5DU3obstDtyl7KuLvZXQ+GJ1
JKRGEtTTRl5vM2RIi7mwL+j3LRqHgl+ZW1wftsnt2q
nUy7rrxSC6j0eoqabUM/hpS1hveORvLcEbo+5o1J+r
W0+UYYnZ/cFQcUNhx5538uRaD8PIxq6GxTrT/qI2ef
DDLJB5qmmANILYPxsVg++dXFmQFD59MvETq+R3Om0g
==",
"__VIEWSTATEGENERATOR"=>"CADA6983",
"__EVENTVALIDATION"=>"y2iWoj4pBfE6Ij55U/Hf
Sq/mWPNVk4Hv4Nvg7IDxuN6KElLeNsq4iUIbHMfGQS
8s6oProuk3wXUrqQWG6VleouPj+M3LLkKYR8XhLzmw
e4Cck3tqa/YpGmNLZiNOLkbN4/RhPFq+onAiQ2GDc4
gHlU5aU94WwONQ9ItyzsH4V111bPhKX3gjr9YXhpPg
9UiyWwkNXohLJSWRM9jGfHrgMg==",
"txtCustNo"=>"username",
"txtPassword"=>"password",
"__ASYNCPOST"=>"true",
"btnLogin"=>"Нэвтрэх"
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, $cofile);
curl_setopt($ch, CURLOPT_URL,"https://e.khanbank.com/");//url that is
requested when logging in
curl_setopt($ch,
CURLOPT_REFERER,"https://e.khanbank.com/");//CURLOPT_REFERER
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postfield));
ob_start(); // prevent any output
curl_exec ($ch); // execute the curl command
ob_end_clean(); // stop preventing output
curl_close ($ch);
unset($ch);
$ch = curl_init();
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cofile);
curl_setopt($ch, CURLOPT_URL,"https://e.khanbank.com/pageMain?
content=ucMain_Welcome");
$result = curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
答案 0 :(得分:1)
您无法对值进行硬编码,这些值在每次登录时都会更改,并且与Cookie会话相关联,这意味着您从浏览器获得的EVENTVALIDATION与浏览器的Cookie会话相关联,并且不适用于curl 。
编写一个示例首先将此功能添加到某个地方,您将需要它(它使DOMDocument加载带有utf-8字符集的HTML,这不是DOMDocument的默认值,但khanbank使用utf-8),
function my_dom_loader(string $html): \DOMDocument
{
$html = trim($html);
if (empty($html)) {
//....
}
if (false === stripos($html, '<?xml encoding=')) {
$html = '<?xml encoding="UTF-8">' . $html;
}
$ret = new DOMDocument('', 'UTF-8');
$ret->preserveWhiteSpace = false;
$ret->formatOutput = true;
if (!(@$ret->loadHTML($html, LIBXML_NOBLANKS | LIBXML_NONET | LIBXML_BIGLINES))) {
throw new \Exception("failed to create DOMDocument from input html!");
}
$ret->preserveWhiteSpace = false;
$ret->formatOutput = true;
return $ret;
}
首先创建hhb_curl句柄,
<?php
declare (strict_types = 1);
require_once('hhb_.inc.php');
$hc = new hhb_curl('', true);
现在,khanbank.com使用的是浏览器白名单,如果您未使用白名单的浏览器,则无法登录。白名单的浏览器示例是Google Chrome 75 X64,因此请通过设置< / p>
$hc->setopt(CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36');
下一步获取登录页面以获取cookie和EVENTVALIDATION内容,
$html = $hc->exec('https://e.khanbank.com/')->getStdOut();
现在我们在html中获得了EVENTVALIDATION内容,我们需要从html中解析出它,
$domd = my_dom_loader($html);
$xp = new DOMXPath($domd);
$form = $domd->getElementById("Form1");
$post_data = array();
foreach ($form->getElementsByTagName("input") as $input) {
$post_data[$input->getAttribute("name")] = $input->getAttribute("value");
}
assert(isset($post_data['txtCustNo']), "ERROR: COULD NOT FIND USERNAME INPUT!");
assert(isset($post_data['txtPassword']), "ERROR: COULD NOT FIND PASSWORD INPUT!");
现在$post_data
包含:
array (
'__VIEWSTATE' => '9GT5O4HrKQJrWbF7PRSXu9RiMlpkqY5hO+sN9H0OXxmwYjWMfr2uf4yIgpHtk9sp56RWot30dvKeuGF3+eoOhpNu5nsuGBjtrpb8g8AGMaDbQ0nxpEKS3HILkqccMwFfn7y0LThLfjm0Ow84RGosJa+/5iM9YfP/HFM5HnyHKGJkM84nGEh7QZfoGYwMOU9SSb5dKmxfnmrIo/xXUUh4DT8+LOFGCQ2H5+nPFudTonwfgX6AKBNhkRijlfrUY+ns7HMq699AU38bsaxgD67KEw==',
'__VIEWSTATEGENERATOR' => 'CADA6983',
'__EVENTVALIDATION' => '4FZipDfTouUXBNMfIqlf/SXhPNyW5SBkcH/JIZB/j8kdaJUlMAQzvodpEq2n6WBRvxs6IBGVASOFouDQbqjygKK8+01KbRa9CpEGRiYGdxSIlt0wbZ2wJZeN6kB2ncn2DSd3C3nymCcz1kGHIdR3Dy5l2OlS6JngVCVoXuhpDzsjDQbrRwHST85XOlXdF6jl8/aQPYkSlZkSRQ5BFzdbnw==',
'txtCustNo' => '',
'txtPassword' => '',
'chkRemUser' => '',
)
这些绑定到此特定的cookie会话,因此您每次都必须将它们解析出html,您不能对其进行硬编码,但是仍然缺少一些变量(因为它们是使用javascript设置的) ,而不是HTML),因此添加以下内容:
$post_data['SM'] = 'UpPnlLogin|btnLogin';
$post_data['__LASTFOCUS'] = '';
$post_data['__EVENTARGUMENT'] = '';
$post_data['__EVENTTARGET'] = 'btnLogin';
$post_data['__ASYNCPOST'] = 'true';
现在设置用户名和密码:
$post_data['txtCustNo'] = "username";
$post_data['txtPassword'] = "password";
最后发送实际的登录请求:
$html = $hc->setopt_array(array(
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => http_build_query($post_data),
CURLOPT_URL => 'https://e.khanbank.com/'
))->exec()->getStdOut();
最后:检查登录错误:
$domd = my_dom_loader($html);
$xp = new DOMXPath($domd);
$login_errors = array();
//uk-alert uk-alert-warning
foreach ($xp->query("//*[contains(@class,'alert')]") as $login_error) {
$login_error = trim($login_error->textContent);
if (!empty($login_error)) {
$login_errors[] = $login_error;
}
}
if (!empty($login_errors)) {
var_dump($login_errors);
throw new \RuntimeException("login errors: " . json_encode($login_errors, JSON_PRETTY_PRINT));
}
echo "logged in successfully! :)";
产生:
$ php wtf4.php
array(1) {
[0]=>
string(69) "Нэвтрэх нэр эсвэл нууц үг буруу байна!"
}
PHP Fatal error: Uncaught RuntimeException: login errors: [
"\u041d\u044d\u0432\u0442\u0440\u044d\u0445 \u043d\u044d\u0440 \u044d\u0441\u0432\u044d\u043b \u043d\u0443\u0443\u0446 \u04af\u0433 \u0431\u0443\u0440\u0443\u0443 \u0431\u0430\u0439\u043d\u0430!"
] in /cygdrive/c/projects/misc/wtf4.php:63
Stack trace:
#0 {main}
thrown in /cygdrive/c/projects/misc/wtf4.php on line 63
\u0431\u0430\u0439\u043d\u0430
也很奇怪,这是因为PHP的Exception消息似乎不支持Unicode字符,并且错误消息是用Unicode字符编写的(也许是俄语?)