使用Xpath与PHP解析网站的HTML

时间:2014-10-09 07:05:45

标签: php html xpath domdocument

目前我正在尝试使用xpath从网站解析html页面。

我需要获得以下格式的结果:

  

节目时间:节目名称

例如:

  

1.00PM:Ye Hai Mohabbatein

我正在使用以下代码(如here所示)来获取它但它不起作用。

<?php

libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTMLFile("www.starplus.in/schedule.aspx");
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//table");
foreach ($nodes as $i => $node) {
echo "hy";
    echo "Node($i): ", $node->nodeValue, "\n";
}

?>

如果有人在这个问题上帮助我,我将感激不尽。

1 个答案:

答案 0 :(得分:2)

基本上,只需定位具有该节目名称和时间段的表格div / table。

粗略的例子:

// it seems it doesn't work when there is no user agent
$ch = curl_init('http://www.starplus.in/schedule.aspx');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXPath($dom);

$shows = array();
$tables = $xpath->query("//div[@class='sech_div_bg']/table"); // target that table

foreach ($tables as $table) {
    $time_slot = $xpath->query('./tr[1]/td/span', $table)->item(0)->nodeValue;
    $show_name = $xpath->query('./tr[3]/td/span', $table)->item(0)->nodeValue;
    $shows[] = array('time_slot' => $time_slot, 'show_name' => $show_name);
    echo "$time_slot - $show_name <br/>";
}

// echo '<pre>';
// print_r($shows);