使用cURL登录到网页

时间:2016-02-05 12:40:43

标签: php curl web-scraping

我买了一本关于使用php进行网页抓取的书。其中作者登录https://www.packtpub.com/。这本书已经过时了,所以我无法真正测试出来,因为自发布以来网页已经发生了变化。这是我正在使用的修改后的代码,但登录不成功,我从“帐户选项”字符串中得出的结论不在$results变量中。我应该改变什么?我认为错误来自错误指定目的地。

<?php
// Function to submit form using cURL POST method
function curlPost($postUrl, $postFields, $successString) {
    $useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5;
       en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';  // Setting useragent of a popular browser
    $cookie = 'cookie.txt';  // Setting a cookie file to storecookie
    $ch = curl_init();  // Initialising cURL session
    // Setting cURL options
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);  // PreventcURL from verifying SSL certificate
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
    curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);  // Script shouldfail silently on error
    curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);  // Use cookies
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);  // FollowLocation: headers
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);  // Returningtransfer as a string
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);  // Settingcookiefile
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);  // Settingcookiejar
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent);  // Settinguseragent
    curl_setopt($ch, CURLOPT_URL, $postUrl);  // Setting URL to POSTto
    curl_setopt($ch, CURLOPT_POST, TRUE);  // Setting method as POST
    curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postFields));  // Setting POST fields as array
            $results = curl_exec($ch);  // Executing cURL session
            $httpcode = curl_getinfo($ch,CURLINFO_HTTP_CODE);
                echo "$httpcode";
            curl_close($ch);  // Closing cURL session
            // Checking if login was successful by checking existence of string
            if (strpos($results, $successString)) {
                echo "I'm in.";
                return $results;
            } else {
                echo "Nope, sth went wrong.";
                return FALSE;
            } 
}

$userEmail = 'youremail@email.com';  // Setting your email address for site login
$userPass = 'yourpass';  // Setting your password for sitelogin
$postUrl = 'https://www.packtpub.com';  // Setting URL toPOST to
// Setting form input fields as 'name' => 'value'
$postFields = array(
        'email' => $userEmail,
        'password' => $userPass,
        'destination' => 'https://www.packtpub.com',
        'form_id' => 'packt-user-login-form'
);
$successString = 'Account Options';
$loggedIn = curlPost($postUrl, $postFields, $successString);  //Executing curlPost login and storing results page in $loggedIn

编辑:发布请求:

enter image description here

我替换了

'destination' => 'https://www.packtpub.com'
with    

'op' => 'Login'

,添加了

'form_build_id' => ''

并编辑

$postUrl = 'https://www.packtpub.com/register';

因为这是我在选择copy作为cURL并在编辑器中粘贴时获得的URL。

我还在“没有,但是错误的消息”。我认为这是因为$successString首先不会存储在curl中。什么是应该设置的form-b​​uild-id?每次登录都会发生变化。

2 个答案:

答案 0 :(得分:2)

我发布这个答案,因为我认为这可能会在面对这些问题时帮助你。我在编写网络刮刀时经常这么做。

  1. 打开Firefox。按CTRL + SHIFT + Q
  2. 按网络标签
  3. 访问网站。您将注意到正在监视HTTP请求
  4. 在监控HTTP请求时成功登录
  5. 登录后,右键单击登录时发出的HTTP请求,然后复制为CURL。
  6. 现在你有了CURL请求。使用PHP的cURL复制HTTP请求。再次测试。

    对于网页抓取,您应该非常熟悉监控HTTP标头。您可以使用:

    • 网络监视器(Chrome,Firefox)

    • 的Fiddler

    • Wiresharp

    • MITMProxy

    • 查尔斯

    等等......

答案 1 :(得分:2)

您使用的图书已经过时,Packt Publishing已经更改了他们的网站。它现在包含一个CSRF令牌,如果不通过它,您将永远无法登录。

我开发了一个有效的解决方案。它使用pQuery来解析HTML。您可以使用Composer安装它,也可以下载该软件包并将其包含在您的应用程序中。如果您这样做,请移除require __DIR__ . '/vendor/autoload.php';并替换系统上pquery包的位置。

要通过命令行进行测试,只需运行:php packt_example.php

您还会注意到甚至不需要许多标头,例如useragent。我把它们遗弃了。

<?php

require __DIR__ . '/vendor/autoload.php';

$email = 'myemail@gmail.com';
$password = 'mypassword';

# Initialize a cURL session.
$ch = curl_init('https://www.packtpub.com/register');

# Set the cURL options.
$options = [
    CURLOPT_COOKIEFILE      => 'cookies.txt',
    CURLOPT_COOKIEJAR       => 'cookies.txt',
    CURLOPT_RETURNTRANSFER  => 1
];

# Set the options
curl_setopt_array($ch, $options);

# Execute
$html = curl_exec($ch);

# Grab the CSRF token from the HTML source
$dom = pQuery::parseStr($html);
$csrfToken = $dom->query('[name="form_build_id"]')->val();

# Now we have the form_build_id (aka the CSRF token) we can
# proceed with making the POST request to login. First,
# lets create an array of post data to send with the POST
# request.
$postData = [
    'email'         => $email,
    'password'      => $password,
    'op'            => 'Login',
    'form_build_id' => $csrfToken,
    'form_id'       => 'packt_user_login_form'
];


# Convert the post data array to URL encoded string
$postDataStr = http_build_query($postData);

# Append some fields to the CURL options array to make a POST request.
$options[CURLOPT_POST] = 1;
$options[CURLOPT_POSTFIELDS] = $postDataStr;
$options[CURLOPT_HEADER] = 1;

curl_setopt_array($ch, $options);

# Execute
$response = curl_exec($ch);

# Extract the headers from the response
$headerSize = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$headers = substr($response, 0, $headerSize);

# Close cURL handle
curl_close($ch);

# If login is successful, the headers will contain a location header
# to the url http://www.packtpub.com/index
if(!strpos($headers, 'packtpub.com/index'))
{
    print 'Login Failed';
    exit;
}

print 'Logged In';