结合PHP CURL和DOM

时间:2016-03-13 04:58:04

标签: php dom curl

我有一个结合了CURL和DOM的代码。我的代码:

<?php

// Create temp file to store cookies
$ckfile = tempnam ("/tmp", "CURLCOOKIE");

// URL to login page
$url = "https://www.investagrams.com/login";

// Get Login page and its cookies and save cookies in the temp file
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
#$output = curl_exec($ch);

$fields = array(
'ctl00$WelcomePageMainContent$ctl00$Username' => '********',
'ctl00$WelcomePageMainContent$ctl00$Password' => '********',
);

$fields_string = '';
foreach($fields as $key=>$value) {
$fields_string .= $key . '=' . $value . '&';
}
rtrim($fields_string, '&');

// Post login form and follow redirects
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, count($fields));
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
#$output = curl_exec($ch);

$url = "https://www.investagrams.com/Stock/RealTimeMonitoring";
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
#echo $output;

$dom = new DomDocument;
$dom->loadHtmlFile($output);

$xpath = new DomXPath($dom);

// collect header names
$headerNames = array();
foreach ($xpath->query('//table[@id="StockQuoteTable"]//th') as $node) {
$headerNames[] = $node->nodeValue;
}

// collect data
$data = array();
foreach ($xpath->query('//tbody[@id="StockQuoteTable:tbody_element"]/tr')  as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
    $rowData[] = $cell->nodeValue;
}

$data[] = array_combine($headerNames, $rowData);
}

print_r($data);


?>

这只加载到&#34; Arrays():&#34; 这是我要提取的表的信息: 我不知道哪个部分是错的。 Curl部分100%工作,错误在DOM部分。谢谢

<div class="dataTables_scrollBody" style="overflow: auto; height: 300px;  width: 100%;">


<table id="StockQuoteTable" class="table dataTable no-footer" role="grid" aria-describedby="StockQuoteTable_info" style="width: 1166px;">
    <thead></thead>
    <tbody>
        <tr id="num1" class="odd" role="row"

1 个答案:

答案 0 :(得分:0)

我能够找到您的代码的部分问题,但似乎curl请求提供的HTML代码似乎有一些错误阻止函数DOMXPath::query返回有效匹配。

我在代码中修复的问题是由于您使用DOMDocument::loadHTMLfile而不是DOMDocument::loadHTML来包含从curl请求中检索到的HTML。所以有效的脚本应该是:

<?php

// Create temp file to store cookies
$ckfile = tempnam ("/tmp", "CURLCOOKIE");

// URL to login page
$url = "https://www.investagrams.com/login";

// Get Login page and its cookies and save cookies in the temp file
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
#$output = curl_exec($ch);

$fields = array(
'ctl00$WelcomePageMainContent$ctl00$Username' => '********',
'ctl00$WelcomePageMainContent$ctl00$Password' => '********',
);

$fields_string = '';
foreach($fields as $key=>$value) {
$fields_string .= $key . '=' . $value . '&';
}
rtrim($fields_string, '&');

// Post login form and follow redirects
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, count($fields));
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
#$output = curl_exec($ch);

$url = "https://www.investagrams.com/Stock/RealTimeMonitoring";
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
#echo $output;
#print_r($output);
$dom = new DomDocument;
@$dom->loadHtml($output);

$xpath = new DomXPath($dom);

// collect header names
$headerNames = array();

foreach ($xpath->query('//table[@id="StockQuoteTable"]//th') as $node) {
$headerNames[] = $node->nodeValue;
}

// collect data
$data = array();
foreach ($xpath->query('//tbody[@id="StockQuoteTable:tbody_element"]/tr')  as $node) {
$rowData = array();
foreach ($xpath->query('td', $node) as $cell) {
    $rowData[] = $cell->nodeValue;
}

$data[] = array_combine($headerNames, $rowData);
}

print_r($data);


?>

此外,我在loadHTML函数之前添加了@符号以抑制错误。