Question

我想使用php从网站收集一些数据。

查看我的代码;

<?php
fopen("cookies.txt", "w");
$url="http://example.com";
$ch = curl_init();
$host="Host: ".$url;
$header=array('GET /index.html HTTP/1.1',
'$host',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:33.0) Gecko/20100101 Firefox/33.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate',
'Connection: keep-alive');

    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,0);
    curl_setopt( $ch, CURLOPT_COOKIESESSION, true );

    curl_setopt($ch,CURLOPT_COOKIEFILE,'cookies.txt');
    curl_setopt($ch,CURLOPT_COOKIEJAR,'cookies.txt');
    curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
    $result=curl_exec($ch);

$html=gzinflate(substr($result,10,-8));

?>

此处网站响应采用编码形式，因此我将其解码并将其存储在 $ html

中

现在我们在 $ html 中有html代码。

假设像这样的HTML代码......

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
    <meta charset="utf-8" />

... .... ...
... .... ...

<article class="article">
    <div class="company-contact-information">
        <h2>Company Contact Information     <div class="verified-info">
        <span class="icon-verified"></span>information has been verified
    </div>
  </h2>
        <table class="company-info-data table">
                        <tr>
                <td class="icon-col">      <span class="icon-verified"></span>
   </td>
                <th>Company Name:</th>
                <td>Example Co., Ltd.</td>
            </tr>

                        <tr>
                <td class="icon-col">      <span class="icon-verified"></span>
  </td>
                <th>Operational Address:</th>
                <td>Ab. 123, blah blah, No. 11, blah blah. blah, blah St., blah Dist., blah, Blah</td>
            </tr>


... .... ...
... .... ...

</body>
</html>

现在我想收集这些细节;

公司名称：Example Co.，Ltd。运营地址：Ab。 123，等等   等等，11号，等等等等。 blah，blah St.，blah Dist。，blah，Blah

等。

成为可用的格式，如xls。

请帮帮我

使用php从html收集所需的数据

0 个答案: