Question

我有一个网页，结构如下：

<html>
  <body>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
  </body>
</html>

页面中还有其他内容，但就本问题而言，它是无关紧要的（有点）。

我想要做的是从类<a>中提取每个div中的<p>和title元素。我已经采用了很多方法来实现这一目标（simple-html-dom，xPath，正则表达等等）但是由于我对PHP的了解有限，我很难理解并且有点推动正确的方向可能对我有很大帮助。

所以我的问题是，你会用什么？你能给我一个如何使用它的例子吗？它不一定是万无一失的，只要我明白了，我就会做其余的事情。

感谢。

Answer 1

是的，您可以在此特定情况下使用DOMDocument。

这是一个粗略的例子：

$markup = "<html>
  <body>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
    <div class='title'>
      <a></a>
      <p></p>
    </div>
  </body>
</html>";

$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$elements = array();
$search = $xpath->query('//div[@class="title"]');
foreach($search as $node) {
    foreach($node->childNodes as $k => $child) {
        if(isset($child->tagName) && ($child->tagName == 'a' || $child->tagName == 'p')) {
            $data[$k][] = $child;
            // or $child->nodeValue if you want the innertext
        }
    }
}

echo '<pre>';
print_r($data);

或类似的东西，如果你只是期望这个结构总是如此：

$search = $xpath->query('//div[@class="title"]');
foreach($search as $k => $node) {
    $a = $xpath->query('//a', $node)->item(0);
    $p = $xpath->query('//p', $node)->item(0);
    $data[] = array('a' => $a, 'p' => $p);
}

Answer 2

你也可以使用php 这是一个小代码来帮助

   <?php
     $filename="nameofhtmlfile.html"
   $contents = file_get_contents($filename);
   $new_contents = str_replace('<div class=\'title\'><a></a><p></p></div>', '<div class=\'title\'>         </div>', $contents);
  file_put_contents($filename, $new_contents);
    ?>

使用此php脚本读取html文件的内容，并使用php替换语法编辑其内容如果您的html文件变大，您可能需要考虑迭代而不是将所有内容复制到内存

        $f = fopen("file","r");
       if($f){
       while( !feof($f) ){
        $line = fgets($f,4096);
      if ( (stripos($line,"<div class=\'title\'><a></a><p></p></div>")!==FALSE) ){
        $line=preg_replace("<div class=\'title\'><a></a><p></p></div>","<div class=\'title\'>         </div>",$line);
      }
      print $line;
     }
   fclose($f);
   }

使用特定类从每个div中提取所有p和a标记

2 个答案: