初学者PHP抓取帮助 - 获取img src?

时间:2013-04-03 16:14:17

标签: php preg-match screen-scraping preg-match-all

我目前正在努力增加对PHP的了解,并且自己设置了抓取网站并将检索到的数据转换为JSON格式的任务。

以下是我要解析的数据的示例行:

 <tr>
 <td class="first">
     <img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />              
 </td>
 <td >
      Copenhagen
 </td>
 <td>
      Sas
 </td>
 <td>
     SK537
 </td>
 <td>
     02 Apr 10:20
 </td>
 <td class="last">
     Delayed 11:30
 </td>
 </tr>

到目前为止,这是我的PHP代码:

$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table width="100%" cellspacing="0" cellpadding="0" border="0" summary="Departure times detail information"');

$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);

foreach ($rows[0] as $row){

    if ((strpos($row,'<th')===false)){

        preg_match_all("|<td(.*)</td>|U",$row,$cells);

        $url_src = strip_tags($cells[0][0]);

        $airport = strip_tags($cells[0][1]);

        $airline = strip_tags($cells[0][2]);

            $flightnum = strip_tags($cells[0][3]);

            $schedule = strip_tags($cells[0][4]);

            $status = strip_tags($cells[0][5]);

        echo "{$url_src} - {$aiport} - {$airline} - {$flightnum} - {$schedule} -  {$status}<br>\n";

    }

}

我目前几乎可以正确地获得所有值,除非我似乎无法获得包含此内容的单元格的任何内容:

<td class="first">
     <img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />              
</td>

任何人都可以帮助我获取img字符串所需的内容,我很高兴能够在<td></td>中获取整个字符串,如下所示:

<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />

但是如果它可以解析出非常有帮助的src字符串。

1 个答案:

答案 0 :(得分:1)

您的<img>标记根本没有打开,这就是您的正则表达式无法解析它的原因。

尝试:

<td class="first">
     <img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />              
</td>