我目前正在努力增加对PHP的了解,并且自己设置了抓取网站并将检索到的数据转换为JSON格式的任务。
以下是我要解析的数据的示例行:
<tr>
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>
<td >
Copenhagen
</td>
<td>
Sas
</td>
<td>
SK537
</td>
<td>
02 Apr 10:20
</td>
<td class="last">
Delayed 11:30
</td>
</tr>
到目前为止,这是我的PHP代码:
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table width="100%" cellspacing="0" cellpadding="0" border="0" summary="Departure times detail information"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$url_src = strip_tags($cells[0][0]);
$airport = strip_tags($cells[0][1]);
$airline = strip_tags($cells[0][2]);
$flightnum = strip_tags($cells[0][3]);
$schedule = strip_tags($cells[0][4]);
$status = strip_tags($cells[0][5]);
echo "{$url_src} - {$aiport} - {$airline} - {$flightnum} - {$schedule} - {$status}<br>\n";
}
}
我目前几乎可以正确地获得所有值,除非我似乎无法获得包含此内容的单元格的任何内容:
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>
任何人都可以帮助我获取img字符串所需的内容,我很高兴能够在<td></td>
中获取整个字符串,如下所示:
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
但是如果它可以解析出非常有帮助的src字符串。
答案 0 :(得分:1)
您的<img>
标记根本没有打开,这就是您的正则表达式无法解析它的原因。
尝试:
<td class="first">
<img id="ctl00_Content_ctl00_rptInfo_ctl01_Image1" alt="Active" src="../../images/t1.jpg" style="border-width:0px;" />
</td>