PHP遍历DOM并获取属性和文本

时间:2017-12-23 20:49:22

标签: php html domdocument

我正在尝试获取每个容器href的{​​{1}},src电影名称

item-holder-account

结果应该是一个数组:

<div id="item_container">
    <div class="item-holder-account">
        <a href="movie1.html">
            <span class="rollover"></span>
            <img src="movie1.png" alt="">
            <h2 class="list-item-title">Movie 1 <span class="paragraph-end"></span></h2>
        </a>
    </div>

    <div class="item-holder-account">
        <a href="movie2.html">
            <span class="rollover"></span>
            <img src="movie2.png" alt="">
            <h2 class="list-item-title">Movie 2 <span class="paragraph-end"></span></h2>
        </a>
    </div>

    <div class="item-holder-account">
        <a href="movie3.html">
            <span class="rollover"></span>
            <img src="movie3.png" alt="">
            <h2 class="list-item-title">Movie 3 <span class="paragraph-end"></span></h2>
        </a>
    </div>
</div>

我已经尝试但是我被困在这里:

movie1.html
movie2.png
Movie 1

movie2.html
movie2.png
Movie 2

movie3.html
movie3.png
Movie 3

我该如何解决这个问题?

2 个答案:

答案 0 :(得分:1)

我会选择domxpath。根据您的示例,您可以查询具有div类的所有item-holder-account,然后继续提取必要的数据。以下脚本应该执行您想要的操作:

<?php

$file = $argv[1];
$html = file_get_contents($file);
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$data = [];
foreach($xpath->query('//div[@class="item-holder-account"]') as $div) {
    foreach($div->getElementsByTagName('a') as $item) {
        $data[] = [
            'href' => $item->getAttribute('href'),
            'img' => $item->getElementsByTagName('img')->item(0)->getAttribute('src'),
            'text' => $item->getElementsByTagName('h2')->item(0)->nodeValue,
        ];
    }
}

print_r($data);

结果:

Array
(
    [0] => Array
        (
            [href] => movie1.html
            [img] => movie1.png
            [text] => Movie 1 
        )

    [1] => Array
        (
            [href] => movie2.html
            [img] => movie2.png
            [text] => Movie 2 
        )

    [2] => Array
        (
            [href] => movie3.html
            [img] => movie3.png
            [text] => Movie 3 
        )

)

答案 1 :(得分:0)

您可以使用像PHP Simple HTML DOM Parser

这样的DOM解析器
<?php
$str = '<div id="item_container">
        <div class="item-holder-account">
        <a href="movie1.html"> <span class="rollover"></span>
                              <img src="movie1.png" alt="">
                              <h2 class="list-item-title">Movie 1 <span class="paragraph-end"></span></h2>
          </a>
        </div>
        <div class="item-holder-account">
        <a href="movie2.html"> <span class="rollover"></span>
                              <img src="movie2.png" alt="">
                              <h2 class="list-item-title">Movie 2 <span class="paragraph-end"></span></h2>
          </a>
        </div>
        <div class="item-holder-account">
        <a href="movie3.html"> <span class="rollover"></span>
                              <img src="movie3.png" alt="">
                              <h2 class="list-item-title">Movie 3 <span class="paragraph-end"></span></h2>
          </a>
        </div>
        </div>';
require 'simple_html_dom.php';

$html = str_get_html($str);
$arr = array();
foreach($html->find('.item-holder-account') as $element){
    $subarr = array();
    foreach($element->find('a') as $a){
        $subarr[] = $a->href;
    }
    foreach($element->find('img') as $a){
        $subarr[] = $a->src;
    }
    foreach($element->find('h2') as $a){
        $subarr[] = $a->innertext;
    }
    $arr[] = $subarr;
}
echo '<pre>';
var_dump($arr);
echo '</pre>'; 



/* output
array(3) {
  [0]=>
  array(3) {
    [0]=>
    string(11) "movie1.html"
    [1]=>
    string(10) "movie1.png"
    [2]=>
    string(43) "Movie 1 "
  }
  [1]=>
  array(3) {
    [0]=>
    string(11) "movie2.html"
    [1]=>
    string(10) "movie2.png"
    [2]=>
    string(43) "Movie 2 "
  }
  [2]=>
  array(3) {
    [0]=>
    string(11) "movie3.html"
    [1]=>
    string(10) "movie3.png"
    [2]=>
    string(43) "Movie 3 "
  }
}
*/