我想使用php从网站收集一些数据。
查看我的代码;
<?php
fopen("cookies.txt", "w");
$url="http://example.com";
$ch = curl_init();
$host="Host: ".$url;
$header=array('GET /index.html HTTP/1.1',
'$host',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate',
'Connection: keep-alive');
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,0);
curl_setopt( $ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch,CURLOPT_COOKIEFILE,'cookies.txt');
curl_setopt($ch,CURLOPT_COOKIEJAR,'cookies.txt');
curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
$result=curl_exec($ch);
$html=gzinflate(substr($result,10,-8));
?>
此处网站响应采用编码形式,因此我将其解码并将其存储在 $ html
中现在我们在 $ html 中有html代码。
假设像这样的HTML代码......
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
... .... ...
... .... ...
<article class="article">
<div class="company-contact-information">
<h2>Company Contact Information <div class="verified-info">
<span class="icon-verified"></span>information has been verified
</div>
</h2>
<table class="company-info-data table">
<tr>
<td class="icon-col"> <span class="icon-verified"></span>
</td>
<th>Company Name:</th>
<td>Example Co., Ltd.</td>
</tr>
<tr>
<td class="icon-col"> <span class="icon-verified"></span>
</td>
<th>Operational Address:</th>
<td>Ab. 123, blah blah, No. 11, blah blah. blah, blah St., blah Dist., blah, Blah</td>
</tr>
... .... ...
... .... ...
</body>
</html>
现在我想收集这些细节;
公司名称:Example Co.,Ltd。运营地址:Ab。 123,等等 等等,11号,等等等等。 blah,blah St.,blah Dist。,blah,Blah
等。
成为可用的格式,如xls。
请帮帮我