我正在使用curl来抓取HTML页面。它完美地擦除了预标签之间的数据。不过,我想跳过前五行。有什么我可以添加到代码来做到这一点?这是我的代码:
<?php
function curl_download($Url){
if (!function_exists('curl_init')){
die('cURL is not installed. Install and try again.');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
$start = strpos($output, '<pre>');
$end = strpos($output, '</pre>', $start);
$length = $end-$start;
$output = substr($output, $start, $length);
curl_close($ch);
return $output;
}
print curl_download('http://athleticsnews.co.za/results/20140207BOLALeague3/140207F006.htm');
?>
这就是HTML的内容:
<pre>
AllTrax Timing - Contractor License 4/22/2014 - 8:31 AM
Boland Athletics League 3 - 2/7/2014
Hosted by Maties AC
Coetzenburg, Stellenbosch
Event 6 Girls 14-15 200 Meter Sprint
所以我试图排除前四行加空行并开始从事件6开始的行中抓取...
答案 0 :(得分:1)
您可以使用正则表达式将线条分开并选择所需的线条:
$str = curl_download('http://.../140207F006.htm');
$re = "/([^\n\r]+)/m";
preg_match_all($re, $str, $matches);
print_r($matches[1]);
结果:
Array
(
[0] => AllTrax Timing - Contractor License 4/22/2014 - 8:31 AM
[1] => Boland Athletics League 3 - 2/7/2014
[2] => Hosted by Maties AC
[3] => Coetzenburg, Stellenbosch
[4] =>
[5] => Event 6 Girls 14-15 200 Meter Sprint
[6] => ============================================================================
[7] => Name Age Team Finals Wind Points
[8] => ============================================================================
[9] => Finals
[10] => 1 Shan Fourie Bola 29.03 NWI 10
)
要打印出最后5行,你可以
$matches = $matches[1];
$str = "";
for($i = 5; $i <= 10; $i++) {
$str .= $matches[$i] . PHP_EOL; // Preserve the new line
}
echo $str;
结果:
Event 6 Girls 14-15 200 Meter Sprint
============================================================================
Name Age Team Finals Wind Points
============================================================================
Finals
1 Shan Fourie Bola 29.03 NWI 10