php curl代码跳过被抓取的行

时间:2015-05-27 09:21:32

标签: php curl

我正在使用curl来抓取HTML页面。它完美地擦除了预标签之间的数据。不过,我想跳过前五行。有什么我可以添加到代码来做到这一点?这是我的代码:

<?php

function curl_download($Url){

if (!function_exists('curl_init')){
    die('cURL is not installed. Install and try again.');
}

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
$start = strpos($output, '<pre>');
$end = strpos($output, '</pre>', $start);
$length = $end-$start;
$output = substr($output, $start, $length);    


curl_close($ch);

return $output;
}

print curl_download('http://athleticsnews.co.za/results/20140207BOLALeague3/140207F006.htm');

?>

这就是HTML的内容:

<pre>
AllTrax Timing - Contractor License                     4/22/2014 - 8:31 AM
                Boland Athletics League 3 - 2/7/2014                    
                        Hosted by Maties AC                             
                     Coetzenburg, Stellenbosch                          

Event 6  Girls 14-15 200 Meter Sprint

所以我试图排除前四行加空行并开始从事件6开始的行中抓取...

1 个答案:

答案 0 :(得分:1)

您可以使用正则表达式将线条分开并选择所需的线条:

$str = curl_download('http://.../140207F006.htm');
$re = "/([^\n\r]+)/m"; 
preg_match_all($re, $str, $matches);
print_r($matches[1]);

结果:

Array
(
    [0] =>  AllTrax Timing - Contractor License                     4/22/2014 - 8:31 AM
    [1] =>                     Boland Athletics League 3 - 2/7/2014                    
    [2] =>                             Hosted by Maties AC                             
    [3] =>                          Coetzenburg, Stellenbosch                          
    [4] =>  
    [5] => Event 6  Girls 14-15 200 Meter Sprint
    [6] => ============================================================================
    [7] =>     Name                     Age Team                    Finals  Wind Points
    [8] => ============================================================================
    [9] => Finals                                                                      
    [10] =>   1 Shan Fourie                  Bola                     29.03   NWI  10  
)

要打印出最后5行,你可以

$matches = $matches[1];
$str = "";
for($i = 5; $i <= 10; $i++) {
    $str .= $matches[$i] . PHP_EOL; // Preserve the new line
}
echo $str;

结果:

Event 6  Girls 14-15 200 Meter Sprint
============================================================================
    Name                     Age Team                    Finals  Wind Points
============================================================================
Finals                                                                      
  1 Shan Fourie                  Bola                     29.03   NWI  10  

演示:http://ideone.com/ijPiP6