从div类中提取标题列表

时间:2016-05-02 18:01:37

标签: php html screen-scraping

我创建了一个使用CURL连接到网站并获取当天电视列表的功能。我想从html源代码中获取div类标记。

以下是我现在正在使用的内容:

<?php

function get_shows($channel_id, DateTime $dt, $skip_finished = true) {

   $ch = curl_init();
   curl_setopt_array($ch, array(
      CURLOPT_USERAGENT => '',
      CURLOPT_TIMEOUT => 30,
      CURLOPT_CONNECTTIMEOUT => 30,
      CURLOPT_HEADER => false,
      CURLOPT_RETURNTRANSFER => true,
      CURLOPT_FOLLOWLOCATION => true,
      CURLOPT_MAXREDIRS => 5,
   ));

   $date = $dt->format('Y-m-d');
   $tz = $dt->getTimezone();

   $now = new DateTime('now', $tz);
   $today = $now->format('Y-m-d');

   $shows = array();  
   for($p=0;$p<=6;$p++) {
      $url = 'http://www.example.com/channels/tvlistings?date=' . $date;
      curl_setopt($ch, CURLOPT_URL, $url);
      echo $url;
   }
}
?>

在html源代码中有六个具有相同名称的类,您可以看到:

<div class="rowChannel">
        <div class="colTimes">
             <span class="title">some information 1</span><span class="desc"><p>description goes here</p></span>


<div class="rowChannel">
        <div class="colTimes">

            <span class="title">some information 2</span><span class="desc"><p>description goes here</p></span>


<div class="rowChannel">
        <div class="colTimes">

            <span class="title">some information 3</span><span class="desc"><p>description goes here</p></span>


<div class="rowChannel">
        <div class="colTimes">

            <span class="title">some information 4</span><span class="desc"><p>description goes here</p></span>

<div class="rowChannel">
        <div class="colTimes">

            <span class="title">some information 5</span><span class="desc"><p>description goes here</p></span>

<div class="rowChannel">
        <div class="colTimes">

            <span class="title">some information 6</span><span class="desc"><p>description goes here</p></span>

我想要的是从第一个类别中提取标题和描述列表,而忽略其他类别。

像这样:

<div class="rowChannel">
        <div class="colTimes">

            <span class="title">some information 2</span><span class="desc"><p>description goes here</p></span>

2 个答案:

答案 0 :(得分:0)

假设HTML文件为well formed(示例中提供的文件不是),您可以使用XPath提取所需信息。

例如:

$body = '<root>
   <div class="rowChannel">
      <div class="colTimes">
         <span class="title">some information 1</span>
         <span class="desc">
            <p>description goes here</p>
         </span>
      </div>
   </div>
   <div class="rowChannel">
      <div class="colTimes">
         <span class="title">some information 2</span>
         <span class="desc">
            <p>description goes here</p>
         </span>
      </div>
   </div>
</root>';

 // clear any whitespaces between elements
 $data = preg_replace("/>\s+</", "><", $body);

 $dom = new DOMDocument();
 @$dom->loadHTML(mb_convert_encoding($data, 'HTML-ENTITIES', 'UTF-8'));
 $xpath = new DOMXpath($dom);
 $elements = $xpath->query("//div[@class='colTimes']");
 $listings = [];
 foreach ($elements as $i => $element) {
      $title = $element->childNodes->item(0)->nodeValue;
      $desc = $element->childNodes->item(1)->nodeValue;

      $listings[] = [
           'title' => $title,
           'desc' => $desc
      ];
 }

答案 1 :(得分:0)

您可以根据您的要求使用它:
$ file_contents = curl_exec($ ch); //获取页面内容
preg_match($ s_searchFor,$ file_contents,$ matches); //匹配元素
$ file_contents = $ matches [1];