如何刮掉整行的HTML

时间:2016-05-12 17:34:19

标签: php html-parsing

我正在使用以下内容来抓取一个HTML文档:

<?php
$html = file_get_contents('http://www.atletiek.co.za/atletiek.co.za/uitslae/2016ASASASeniors/160415F004.htm');
$tags = explode(' ',$html);

 foreach ($tags as $tag)
{
  // skip scripts
if (strpos($tag,'script') !== FALSE) continue;
if (strpos($tag,'head') !== FALSE) continue;
if (strpos($tag,'body') !== FALSE) continue;
if (strpos($tag,'FORM') !== FALSE) continue;
if (strpos($tag,'p') !== FALSE) continue;  
if (strpos($tag,'bgcolor') !== FALSE) continue;  
if (strpos($tag,'TYPE') !== FALSE) continue;  
if (strpos($tag,'onClick') !== FALSE) continue;  
if (strpos($tag,'=') !== FALSE) continue;  

  // get text
  $text = strip_tags(' '.$tag);
  // only if text present remember
  if (trim($text) != '') $texts[] = $text;
}

print_r($texts);
?>

html页面如下所示:

<html>
<head>
<script language="JavaScript">
function myprint() {
window.parent.main.focus();
window.print();
}
</script>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<FORM>
&nbsp;&nbsp;&nbsp;&nbsp;<INPUT TYPE="button" onClick="history.go(0)" VALUE="Refresh">
</FORM>
<p>
<pre>
                   4/16/2016 - 20:28 PM
                2016 ASA SA Seniors - 4/15/2016 to 4/16/2016                
                           www.liveresults.co.za                            
                                Coetzenburg                                 

Event 136  Men 200 Meter Sprint
============================================================================
     SA Best: R 19.87  2015        Anaso Jobodwana                          
 Africa Best: C 19.68  1996        Frank Fredericks, NAM                    
 Olympics QS: O 20.50                                                       
Africa Sen Q: A 21.24                                                       
  World Lead: W 20.16                                                       
    Name                     Age Team                    Finals  Wind Points
============================================================================
Finals                                                                      
  1 Clarence Munyai           18 Agn                      20.73A  0.4   8   
  2 Ncincilli Titi            23 Agn-Ind                  20.89A  0.4   7   
  3 Hendrik Maartens          20 Afs                      21.02A  0.4   6   
  4 Roscoe Engel              27 Wpa                      21.07A  0.4   5   
  5 Malasela Senona           17 Agn                      21.20A  0.4   4   
  6 Kyle Appel                18 Wpa                      21.33   0.4   3   
  7 Ethan Noble               18 Wpa                      21.38   0.4   2   
  8 France Ntene              28 Lima                     21.43   0.4   1   
</pre>
</p>
</body>
</html>

我使用// skip scripts部分忽略除<pre>部分之外的所有标记,这是我想要提取的部分。然后我使用空格分隔不同的文本部分。然后将输出放入表中的固定位置。问题是html文件中的数据是动态生成的并且不断变化。不是很多,但是一些单词会更改或插入额外的空格。然后,数字显示在表格中的错误位置。我需要找到一种方法来保持一些线条以及如下所示分解其他线条。任何想法都将不胜感激。

[1] => 2016 ASA SA Seniors
[2]=> Event 136  Men 200 Meter Sprint
[3] =>SA Best: R 19.87  2015 Anaso Jobodwana                          
[4] =>Africa Best: C 19.68  1996 Frank Fredericks, NAM                    
[5] =>Olympics QS: O 20.50                                                       
[6] =>Africa Sen Q: A 21.24                                                       
[7] =>World Lead: W 20.16                                                       
[8] => Name [9] => Age [10] => Team [11] =>Finals  [12] => Wind [13] => Points
[14] => 1 [15] => Clarence Munyai [16] => 18 [17] => Agn [18] => 20.73A [19] => 0.4 [20] => 8   
[21] => 2 [22] => Ncincilli Titi [23] => 23 [24] => Agn-Ind [25] => 20.89A [26] => 0.4 [27] => 7   
[28] => 3 [29] => Hendrik Maartens [30] => 20 [31] => Afs [32] => 21.02A [33] => 0.4 [34] => 6   
[35] => 4 [36] => Roscoe Engel [37] => 27 [38] => Wpa [39] => 21.07A [40] => 0.4 [41] => 5   
[42] => 5 [43] => Malasela Senona [44] => 17 [45] => Agn [46] => 21.20A [47] => 0.4 [48] => 4   
[49] => 6 [50] => Kyle Appel [51] => 18 [52] => Wpa [53] => 21.33 [54] => 0.4 [55] => 3   
[56] => 7 [57] => Ethan Noble [58] => 18 [59] => Wpa [60] => 21.38 [61] => 0.4 [62] => 2   
[63] => 8 [64] => France Ntene [65] => 28 [66] => Lima [67] => 21.43 [68] => 0.4 [69] => 1

1 个答案:

答案 0 :(得分:2)

使用xPath:

$url = 'http://www.atletiek.co.za/atletiek.co.za/uitslae/2016ASASASeniors/160415F004.htm';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
libxml_use_internal_errors(true); // Prevent HTML errors from displaying
$doc = new DOMDocument();
$doc->loadHTML($html); // get the DOM

$xpath = new DOMXPath($doc); // start a new xPath on our DOM Object
$preBlock = $xpath->query('//pre'); // find all pre (we only got one here)

如果您只想嵌入信息:

echo '<pre>'.$preBlock->item(0)->nodeValue.'</pre>';

如果要提取数据:

// get the first of all the pre objects
// get the 'inner value'
// split them by newlines
$preBlockString = explode("\n",$preBlock->item(0)->nodeValue); 
$startResultBlock = false;
$i = 0;
// traverse all rows
foreach ($preBlockString as $line){
    // if we found the 'Finals' marker within the last row start fetching the results
    if($startResultBlock){
        $result = explode(' ', $line);
        // kill all empty entries (originating from all the space characters)
        foreach ($result as $key => $value) if (empty($value)) unset($result[$key]);
        $results[] = $result;
        // my first idea to use list does not work because of all the space characters
        // list($results[$i]['number'], $results[$i]['name'], $results[$i]['age'], $results[$i]['team'], $results[$i]['finals'], $results[$i]['wind'], $results[$i]['points']) = explode(" ", $line);
        $i++;
    }

    // if we found the word 'Finals' we set a marker for the upcoming rows
    if(trim($line) == 'Finals'){
        $startResultBlock = true;
    }
}
var_dump($results);

这会产生一系列条目,如:

array(8) {
    [2]=> string(1) "1" 
    [3]=> string(8) "Clarence" 
    [4]=> string(6) "Munyai" 
    [15]=> string(2) "18" 
    [16]=> string(3) "Agn" 
    [38]=> string(6) "20.73A" 
    [40]=> string(3) "0.4" 
    [43]=> string(1) "8"
}