PHP和XPath查询

时间:2013-05-13 17:07:00

标签: php xpath

我需要从HTML文档中删除一些值以及一些原始HTML。我想过使用XPath,但我不能让我的查询工作。

这是我想要实现的目标:

<div class="unit-id">
   <div class="title">
      some title-1
   </div>

   <div class="another-class">
      another class
   </div>
   <p>segwegw1<p>
   <p>segwegw1<p>
   <p>segwegw1<p>
   <p>segwegw1<p>
   <ul>
     <li>jfjfj</li>
     <li>jfjfj</li>
     <li>jfjfj</li>
   </ul>
</div>


<div class="unit-id">
   <div class="title">
      some title-2
   </div>
   <div class="another-class">
      some other class
   </div>
   <p>segwegw2<p>
   <p>segwegw2<p>
   <p>segwegw2<p>
   <p>segwegw2<p>
</div>


<div class="unit-id">
   <div class="title">
      some title-3
   </div>
   <div class="some-other-class">
      some other data
   </div>
   <p>segwegw3<p>
   <p>segwegw3<p>
   <p>segwegw3<p>
   <p>segwegw3<p>
</div>

所以我希望查询使用unit-id类迭代每个div,并返回divs的值title,其余的HTML,不再包括divs,只有p标记和ul内容的特定单位ID归类为div,然后是下一次迭代。

这可能吗?你能给我一个如何编写这个查询的例子吗?有没有更好的方法呢?

1 个答案:

答案 0 :(得分:3)

此代码与您正在寻找的内容类似:

function get_content($data){
    $doc = new DOMDocument();
    //load HTML string into document object
    if ( ! @$doc->loadHTML($data)){
        return FALSE;
    }
    //create XPath object using the document object as the parameter
    $xpath = new DOMXPath($doc);
    $query = "//div[@class='unit-id']";
    //XPath queries return a NodeList
    $res = $xpath->query($query);
    $out = array();
    foreach ($res as $key => $node){
        //subquery
        $sub = $xpath->query('.//div[@class="title"]', $node);
        $out[$key]['title'] = trim($sub->item(0)->nodeValue);
        foreach ($node->getElementsByTagName('p') as $key2 => $value){
            $out[$key]['par'][$key2] = $value->nodeValue;
        }
        foreach ($node->getElementsByTagName('li') as $key2 => $value){
            $out[$key]['list'][$key2] = $value->nodeValue;
        }
    }
    return $out;
}

请注意,您的HTML中存在错误。您关闭的段落标记应该有尾部斜杠</p>

这是输出:

array
  0 => 
    array
      'title' => string 'some title-1' (length=12)
      'par' => 
        array
          0 => string 'segwegw1' (length=8)
          1 => string 'segwegw1' (length=8)
          2 => string 'segwegw1' (length=8)
          3 => string 'segwegw1' (length=8)
      'list' => 
        array
          0 => string 'jfjfj' (length=5)
          1 => string 'jfjfj' (length=5)
          2 => string 'jfjfj' (length=5)
  1 => 
    array
      'title' => string 'some title-2' (length=12)
      'par' => 
        array
          0 => string 'segwegw2' (length=8)
          1 => string 'segwegw2' (length=8)
          2 => string 'segwegw2' (length=8)
          3 => string 'segwegw2' (length=8)