Question

编辑：为什么减去？

我要做的是以下内容：

我正在尝试使用cURL登录我的学校网站并抓住时间表将其用于我的AI。

所以我需要使用我的通行证和号码登录，但学校网站上的表格也需要隐藏的“令牌”。

<form action="index.php" method="post">
    <input type="hidden" name="token" value="becb14a25acf2a0e697b50eae3f0f205" />
    <input type="text" name="user" />
    <input type="password" name="password" />
    <input type="submit" value="submit">
</form>

我能够成功检索令牌。然后我尝试登录，但它失败了。

// Getting the whole website
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.school.com');
$data = curl_exec($ch);

// Retrieving the token and putting it in a POST
$regex = '/<regexThatWorks>/';
preg_match($regex,$data,$match);
$postfields = "user=<number>&password=<secret>&token=$match[1]";

// Should I use a fresh cURL here?

// Setting the POST options, etc.
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);

// I won't use CURLOPT_RETURNTRANSFER yet, first I want to see results. 
$data = curl_exec($ch);

curl_close($ch);

嗯......它不起作用......

令牌是否可以更改每个curl_exec？因为该网站第二次无法识别该脚本......
我应该为第二部分创建一个新的cURL实例（？）吗？
还有另一种方法可以在1个连接中获取令牌吗？
缓存？

对不起我可怕的英语，我是荷兰人。

Answer 1

您收到的错误消息是什么？独立于此;您学校的网站可能会检查引荐来源标题，并确保该请求来自（一个假装是...的应用程序）其登录页面。

Answer 2

这就是我解决它的方式。问题可能是'not-using-cookies'部分。这仍然是“丑陋”的代码，所以欢迎任何改进！

// This part is for retrieving the token from the hidden field.
// To be honest, I have no idea what the cookie lines actually do, but it works.
$getToken= curl_init();
curl_setopt($getToken, CURLOPT_URL, '<schoolsite>');       // Set the link
curl_setopt($getToken, CURLOPT_COOKIEJAR, 'cookies.txt');  // Magic
curl_setopt($getToken, CURLOPT_COOKIEFILE, 'cookies.txt'); // Magic
curl_setopt($getToken, CURLOPT_RETURNTRANSFER, 1);         // Return only as a string
$data = curl_exec($token);                                 // Perform action

// Close the connection if there are no errors
if(curl_errno($token)){print curl_error($token);}
else{curl_close($token);} 

// Use a regular expression to fetch the token
$regex = '/name="token" value="(.*?)"/';
preg_match($regex,$data,$match);

// Put the login info and the token in a post header string
$postfield = "token=$match[1]&user=<number>&paswoord=<mine>";
echo($postfields);

// This part is for logging in and getting the data.
$site = curl_init();
curl_setopt($site, CURLOPT_URL, '<school site');
curl_setopt($site, CURLOPT_COOKIEJAR, 'cookies.txt');    // Magic
curl_setopt($site, CURLOPT_COOKIEFILE, 'cookies.txt');   // Magic
curl_setopt($site, CURLOPT_POST, 1);                     // Use POST (not GET)
curl_setopt($site, CURLOPT_POSTFIELDS, $postfield);      // Insert headers
$forevil_uuh_no_GOOD_purposes = curl_exec($site);        // Output the results

// Close connection if no errors           
if(curl_errno($site)){print curl_error($site);}
else{curl_close($site);}

Answer 3

当您构建一个scraper时，您可以创建自己的类来处理您在域中需要执行的操作。您可以从创建自己的一组请求和响应类开始，这些类处理您需要处理的内容。

创建自己的请求类将允许您以您需要的方式实现curl请求。您可以创建自己的响应类来帮助您访问/解析返回的HTML。

这是我为演示创建的一些类的简单用法示例：

# simple get request
$request = new MyRequest('http://hakre.wordpress.com/');
$response = new MyResponse($request);
foreach($response->xpath('//div[@id="container"]//div[contains(normalize-space(@class), " post ")]') as $node)
{
    if (!$node->h2->a) continue;
    echo $node->h2->a, "\n<", $node->h2->a['href'] ,">\n\n"; 
}

它将返回我的博客帖子：

Will Automattic join Dec 29 move away from GoDaddy day?
<http://hakre.wordpress.com/2011/12/23/will-automattic-join-dec-29-move-away-from-godaddy-day/>

PHP UTF-8 string Length
<http://hakre.wordpress.com/2011/12/13/php-utf-8-string-length/>

Title belongs into Head
<http://hakre.wordpress.com/2011/11/02/title-belongs-into-head/>

...

然后发送一个get请求很容易，可以使用xpath表达式（这里是SimpleXML）轻松访问响应。 XPath可以用于从表单字段中选择令牌，因为它允许您比使用正则表达式更轻松地查询文档的数据。

发送帖子请求是下一个要构建的内容，我尝试为我的博客编写一个登录脚本，结果发现它工作得很好。我还需要解析响应头，所以我在请求和响应类中添加了一些例程。

# simple post request
$request = new MyRequest('https://example.wordpress.com/wp-login.php');
$postFields = array(
    'log' => 'username', 
    'pwd' => 'password',
);
$request->setPostFields($postFields);
$response = new MyResponse($request->returnHeaders(1)->execute());
echo (string) $response; # output to view headers

考虑到您的场景，您可能想要编辑自己的请求类以更好地处理您需要的内容，我也在您使用Cookie时使用了Cookie。因此，基于这些类的一些代码可能如下所示：

# input values
$url = '<schoolsite>';
$user  = '<number>';
$password = '<secret>';

# execute the first get request to obtain token
$response = new MyResonse(new MyRequest($url));
$token = (string) $response->xpath('//input[@name="token"]/@value');

# execute the second login post request
$request = new MyRequest($url);
$postFields = array(;
    'user' => $user, 
    'password' => $password,
    'token' => $token
);
$request->setPostFields($postFields)->execute();

Demo和code as gist。

如果您想进一步改进这一点，下一步就是为自己创建一个“学校服务”课程，您可以使用该课程来获取日程安排：

class MySchoolService
{
    private $url, $user, $pass;
    private $isLoggedIn;
    public function __construct($url, $user, $pass)
    {
        $this->url = $url;
        ...
    }
    public function getSchedule()
    {
        $this->ensureLogin();

        # your code to obtain the schedule, e.g. in form of an array.
        $schedule = ...

        return $schedule;
    }
    private function ensureLogin($reuse = TRUE)
    {
        if ($reuse && $this->isLoggedIn) return;

        # execute the first get request to obtain token
        $response = new MyResonse(new MyRequest($this->url));
        $token = (string) $response->xpath('//input[@name="token"]/@value');

        # execute the second login post request
        $request = new MyRequest($this->url);
        $postFields = array(;
            'user' => $this->user, 
            'password' => $this->password,
            'token' => $token
        );
        $request->setPostFields($postFields)->execute();

        $this->isLoggedIn = TRUE;
    }
}

在您将请求/响应逻辑很好地包装到MySchoolService类之后，您只需要使用正确的配置对其进行实例化，您就可以在网站中轻松使用它：

$school = new MySchoolService('<schoolsite>', '<number>', '<secret>');
$schedule = $school->getSchedule();

您的主脚本仅使用MySchoolService。

MySchoolService负责使用MyRequest和MyResponse个对象。

MyRequest负责处理带有cookie等的HTTP请求（此处带有cUrl）。

MyResponse对解析HTTP响应有所帮助。

将其与标准互联网浏览器进行比较：

Browser: Handles cookies and sessions, does HTTP requests and parses responses.

MySchoolService: Handles cookies and sessions for your school, does HTTP requests and parses responses.

因此，您现在可以在脚本中使用学校浏览器来执行您想要的操作。如果您需要更多选项，可以轻松扩展它。

我希望这是有用的，起点是防止一遍又一遍地编写相同的cUrl代码行，并为您提供更好的解析返回值的接口。 MySchoolService是一些糖，可以在您自己的网站/应用程序代码中轻松处理。

无法从我的学校网站获取我的日程安排数据。用cURL登录不会工作

3 个答案: