我正在使用以下内容来抓取一个HTML文档:
<?php
$html = file_get_contents('http://www.atletiek.co.za/atletiek.co.za/uitslae/2016ASASASeniors/160415F004.htm');
$tags = explode(' ',$html);
foreach ($tags as $tag)
{
// skip scripts
if (strpos($tag,'script') !== FALSE) continue;
if (strpos($tag,'head') !== FALSE) continue;
if (strpos($tag,'body') !== FALSE) continue;
if (strpos($tag,'FORM') !== FALSE) continue;
if (strpos($tag,'p') !== FALSE) continue;
if (strpos($tag,'bgcolor') !== FALSE) continue;
if (strpos($tag,'TYPE') !== FALSE) continue;
if (strpos($tag,'onClick') !== FALSE) continue;
if (strpos($tag,'=') !== FALSE) continue;
// get text
$text = strip_tags(' '.$tag);
// only if text present remember
if (trim($text) != '') $texts[] = $text;
}
print_r($texts);
?>
html页面如下所示:
<html>
<head>
<script language="JavaScript">
function myprint() {
window.parent.main.focus();
window.print();
}
</script>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<FORM>
<INPUT TYPE="button" onClick="history.go(0)" VALUE="Refresh">
</FORM>
<p>
<pre>
4/16/2016 - 20:28 PM
2016 ASA SA Seniors - 4/15/2016 to 4/16/2016
www.liveresults.co.za
Coetzenburg
Event 136 Men 200 Meter Sprint
============================================================================
SA Best: R 19.87 2015 Anaso Jobodwana
Africa Best: C 19.68 1996 Frank Fredericks, NAM
Olympics QS: O 20.50
Africa Sen Q: A 21.24
World Lead: W 20.16
Name Age Team Finals Wind Points
============================================================================
Finals
1 Clarence Munyai 18 Agn 20.73A 0.4 8
2 Ncincilli Titi 23 Agn-Ind 20.89A 0.4 7
3 Hendrik Maartens 20 Afs 21.02A 0.4 6
4 Roscoe Engel 27 Wpa 21.07A 0.4 5
5 Malasela Senona 17 Agn 21.20A 0.4 4
6 Kyle Appel 18 Wpa 21.33 0.4 3
7 Ethan Noble 18 Wpa 21.38 0.4 2
8 France Ntene 28 Lima 21.43 0.4 1
</pre>
</p>
</body>
</html>
我使用// skip scripts
部分忽略除<pre>
部分之外的所有标记,这是我想要提取的部分。然后我使用空格分隔不同的文本部分。然后将输出放入表中的固定位置。问题是html文件中的数据是动态生成的并且不断变化。不是很多,但是一些单词会更改或插入额外的空格。然后,数字显示在表格中的错误位置。我需要找到一种方法来保持一些线条以及如下所示分解其他线条。任何想法都将不胜感激。
[1] => 2016 ASA SA Seniors
[2]=> Event 136 Men 200 Meter Sprint
[3] =>SA Best: R 19.87 2015 Anaso Jobodwana
[4] =>Africa Best: C 19.68 1996 Frank Fredericks, NAM
[5] =>Olympics QS: O 20.50
[6] =>Africa Sen Q: A 21.24
[7] =>World Lead: W 20.16
[8] => Name [9] => Age [10] => Team [11] =>Finals [12] => Wind [13] => Points
[14] => 1 [15] => Clarence Munyai [16] => 18 [17] => Agn [18] => 20.73A [19] => 0.4 [20] => 8
[21] => 2 [22] => Ncincilli Titi [23] => 23 [24] => Agn-Ind [25] => 20.89A [26] => 0.4 [27] => 7
[28] => 3 [29] => Hendrik Maartens [30] => 20 [31] => Afs [32] => 21.02A [33] => 0.4 [34] => 6
[35] => 4 [36] => Roscoe Engel [37] => 27 [38] => Wpa [39] => 21.07A [40] => 0.4 [41] => 5
[42] => 5 [43] => Malasela Senona [44] => 17 [45] => Agn [46] => 21.20A [47] => 0.4 [48] => 4
[49] => 6 [50] => Kyle Appel [51] => 18 [52] => Wpa [53] => 21.33 [54] => 0.4 [55] => 3
[56] => 7 [57] => Ethan Noble [58] => 18 [59] => Wpa [60] => 21.38 [61] => 0.4 [62] => 2
[63] => 8 [64] => France Ntene [65] => 28 [66] => Lima [67] => 21.43 [68] => 0.4 [69] => 1
答案 0 :(得分:2)
使用xPath:
$url = 'http://www.atletiek.co.za/atletiek.co.za/uitslae/2016ASASASeniors/160415F004.htm';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
libxml_use_internal_errors(true); // Prevent HTML errors from displaying
$doc = new DOMDocument();
$doc->loadHTML($html); // get the DOM
$xpath = new DOMXPath($doc); // start a new xPath on our DOM Object
$preBlock = $xpath->query('//pre'); // find all pre (we only got one here)
如果您只想嵌入信息:
echo '<pre>'.$preBlock->item(0)->nodeValue.'</pre>';
如果要提取数据:
// get the first of all the pre objects
// get the 'inner value'
// split them by newlines
$preBlockString = explode("\n",$preBlock->item(0)->nodeValue);
$startResultBlock = false;
$i = 0;
// traverse all rows
foreach ($preBlockString as $line){
// if we found the 'Finals' marker within the last row start fetching the results
if($startResultBlock){
$result = explode(' ', $line);
// kill all empty entries (originating from all the space characters)
foreach ($result as $key => $value) if (empty($value)) unset($result[$key]);
$results[] = $result;
// my first idea to use list does not work because of all the space characters
// list($results[$i]['number'], $results[$i]['name'], $results[$i]['age'], $results[$i]['team'], $results[$i]['finals'], $results[$i]['wind'], $results[$i]['points']) = explode(" ", $line);
$i++;
}
// if we found the word 'Finals' we set a marker for the upcoming rows
if(trim($line) == 'Finals'){
$startResultBlock = true;
}
}
var_dump($results);
这会产生一系列条目,如:
array(8) {
[2]=> string(1) "1"
[3]=> string(8) "Clarence"
[4]=> string(6) "Munyai"
[15]=> string(2) "18"
[16]=> string(3) "Agn"
[38]=> string(6) "20.73A"
[40]=> string(3) "0.4"
[43]=> string(1) "8"
}