我需要抓取这个HTML页面......
....使用PHP和XPath在名为" PO G. TATARELLA-CERIGNOLA &#的表格下的绿色框中获取值 10 34。
(注意:如果您尝试浏览它,您可以在该页面中看到不同的值...它并不重要......它会改变它的恐怖......)
我使用此PHP代码示例来打印值...
<?php
ini_set('display_errors', 'On');
error_reporting(E_ALL);
$url = 'https://www.sanita.puglia.it/monitorpo/aslfg/monitorps-web/monitorps/monitorPSperASL.do?codNazionale=160115';
$xpath_for_parsing = '/html/body/div[4]/table/tbody/tr[2]/td[4]/div';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
@$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
通过这种方式,我获得了&#34; N.D。&#34;输出不是&#34; 10 &#34;正如我想的那样。
页面源代码如下......
在我的代码中,我不想使用&#34;绝对xpath&#34;所以,我尝试使用类似的语法(我知道它不起作用,但我是xpath的新手......)
$xpath_for_parsing = '//*[div="cRiga3 boxtriageS"]';
但结果总是一样。
任何建议/示例?
答案 0 :(得分:1)
我认为以下内容应该有所帮助 - 你需要调整XPath查询,或许是针对特定的表,因此特定的单元格内容,但主代码似乎工作正常。我怀疑原始代码的问题是url是https
,在发出curl请求时通常需要额外的配置设置。 curlrequest
函数中有一些设置可以删除,我只是从另一个脚本中复制了这些设置。
将$cacert
的路径更改为系统上cacert.pem
的副本或live version on curl.haxx.se
$url = 'https://www.sanita.puglia.it/monitorpo/aslfg/monitorps-web/monitorps/monitorPSperASL.do?codNazionale=160115';
function _curlrequest( $url=null, $options=null ){
$cacert='c:/wwwroot/cacert.pem';
$vbh = fopen('php://temp', 'w+');
$res=array(
'response' => null,
'verbose' => null,
'info' => array( 'http_code' => 100 ),
'headers' => null,
'errors' => null
);
if( is_null( $url ) ) return (object)$res;
session_write_close();
/* Initialise curl request object */
$curl=curl_init();
if( parse_url( $url,PHP_URL_SCHEME )=='https' ){
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, true );
curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 2 );
curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
}
/* Define standard options */
curl_setopt( $curl, CURLOPT_URL,trim( $url ) );
curl_setopt( $curl, CURLOPT_AUTOREFERER, true );
curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $curl, CURLOPT_FAILONERROR, true );
curl_setopt( $curl, CURLOPT_HEADER, false );
curl_setopt( $curl, CURLINFO_HEADER_OUT, false );
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $curl, CURLOPT_BINARYTRANSFER, true );
curl_setopt( $curl, CURLOPT_CONNECTTIMEOUT, 20 );
curl_setopt( $curl, CURLOPT_TIMEOUT, 60 );
curl_setopt( $curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' );
curl_setopt( $curl, CURLOPT_MAXREDIRS, 10 );
curl_setopt( $curl, CURLOPT_ENCODING, '' );
curl_setopt( $curl,CURLOPT_VERBOSE,true );
curl_setopt( $curl,CURLOPT_NOPROGRESS,true );
curl_setopt( $curl,CURLOPT_STDERR,$vbh );
/* Assign runtime parameters as options */
if( isset( $options ) && is_array( $options ) ){
foreach( $options as $param => $value ) curl_setopt( $curl, $param, $value );
}
/* Execute the request and store responses */
$res=(object)array(
'response' => curl_exec( $curl ),
'info' => (object)curl_getinfo( $curl ),
'errors' => curl_error( $curl )
);
rewind( $vbh );
$res->verbose=stream_get_contents( $vbh );
fclose( $vbh );
curl_close( $curl );
return $res;
}
function getdom( $data=false, $debug=false ){
try{
if( !$data )throw new Exception('No data passed whilst trying to invoke DOMDocument');
libxml_use_internal_errors( true );
$dom = new DOMDocument();
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( $data );
$errors=libxml_get_errors();
libxml_clear_errors();
return !empty( $errors ) && $debug ? $errors : $dom;
}catch( Exception $e ){
echo $e->getMessage();
}
}
$obj=_curlrequest( $url );
if( $obj->info->http_code==200 ){
$dom=getdom( $obj->response );
$xp=new DOMXPath( $dom );
$query='//div[ contains( @class,"cRiga3 boxtriageS" ) ]';
$col=$xp->query( $query );
if( !empty( $col ) && $col->length > 0 ){
foreach( $col as $node )echo $node->nodeValue . '<br />';
}
}
此输出
2
20
37
>1h
1
2
24
10
5
7
32
29
0
3
25
5
0
0
6
2