我创建了一个使用CURL连接到网站并获取当天电视列表的功能。我想从html源代码中获取div
类标记。
以下是我现在正在使用的内容:
<?php
function get_shows($channel_id, DateTime $dt, $skip_finished = true) {
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_USERAGENT => '',
CURLOPT_TIMEOUT => 30,
CURLOPT_CONNECTTIMEOUT => 30,
CURLOPT_HEADER => false,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
));
$date = $dt->format('Y-m-d');
$tz = $dt->getTimezone();
$now = new DateTime('now', $tz);
$today = $now->format('Y-m-d');
$shows = array();
for($p=0;$p<=6;$p++) {
$url = 'http://www.example.com/channels/tvlistings?date=' . $date;
curl_setopt($ch, CURLOPT_URL, $url);
echo $url;
}
}
?>
在html源代码中有六个具有相同名称的类,您可以看到:
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 1</span><span class="desc"><p>description goes here</p></span>
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 2</span><span class="desc"><p>description goes here</p></span>
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 3</span><span class="desc"><p>description goes here</p></span>
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 4</span><span class="desc"><p>description goes here</p></span>
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 5</span><span class="desc"><p>description goes here</p></span>
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 6</span><span class="desc"><p>description goes here</p></span>
我想要的是从第一个类别中提取标题和描述列表,而忽略其他类别。
像这样:
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 2</span><span class="desc"><p>description goes here</p></span>
答案 0 :(得分:0)
假设HTML文件为well formed(示例中提供的文件不是),您可以使用XPath提取所需信息。
例如:
$body = '<root>
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 1</span>
<span class="desc">
<p>description goes here</p>
</span>
</div>
</div>
<div class="rowChannel">
<div class="colTimes">
<span class="title">some information 2</span>
<span class="desc">
<p>description goes here</p>
</span>
</div>
</div>
</root>';
// clear any whitespaces between elements
$data = preg_replace("/>\s+</", "><", $body);
$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($data, 'HTML-ENTITIES', 'UTF-8'));
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//div[@class='colTimes']");
$listings = [];
foreach ($elements as $i => $element) {
$title = $element->childNodes->item(0)->nodeValue;
$desc = $element->childNodes->item(1)->nodeValue;
$listings[] = [
'title' => $title,
'desc' => $desc
];
}
答案 1 :(得分:0)
您可以根据您的要求使用它:
$ file_contents = curl_exec($ ch); //获取页面内容
preg_match($ s_searchFor,$ file_contents,$ matches); //匹配元素
$ file_contents = $ matches [1];