php - 简单的HTML dom - 其他元素之间的元素

时间:2014-10-19 13:58:15

标签: php html simple-html-dom

我试图编写一个php脚本来抓取一个网站,并在数据库中保留一些元素。

这是我的问题:网页是这样写的:

<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>

<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

我想只获得有趣文本的h2和p,而不是p class =&#34; one_class&#34;。

我试过这个php代码:

<?php
$numberP = 0;
foreach($html->find('p') as $p)
{
    $pIsOneClass = PIsOneClass($html, $p);

    if($pIsOneClass == false)
    {   
        echo $p->outertext;
                $h2 = $html->find("h2", $numberP);
                echo $h2->outertext;
                $numberP++;  
        }

}
?>

PIsOneClass($ html,$ p)函数是:

<?php
function PIsOneClass($html, $p) 
{
foreach($html->find("p.one_class") as $p_one_class)
    {
        if($p ==  $p_one_class)
        {
            return true;
        }           
    }
    return false;
}
?> 

它不起作用,我理解为什么,但我不知道如何解决它。

我们怎么说&#34;我希望每个没有上课的人都在两个h2之间?&#34;

很多!

2 个答案:

答案 0 :(得分:0)

来自simpleHTML dom manual

[attribute=value]   

匹配具有指定属性且具有特定值的元素。 或

[!attribute]

匹配没有指定属性的元素。

答案 1 :(得分:0)

使用XPath可以更轻松地执行此任务,因为您要抓取多个元素,并且希望保持源的顺序。您可以使用PHP的DOM库(包括DOMXPath)来查找和过滤所需的元素:

$html = '<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>

<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>

<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>';

# create a new DOM document and load the html
$dom = new DOMDocument;
$dom->loadHTML($html);
# create a new DOMXPath object
$xp = new DOMXPath($dom);

# search for all h2 elements and all p elements that do not have the class 'one_class'
$interest = $xp->query('//h2 | //p[not(@class="one_class")]');

# iterate through the array of search results (h2 and p elements), printing out node
# names and values
foreach ($interest as $i) {
    echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL;
}

输出:

node h2, value: The title 1
node p, value:  Some interesting text 
node h2, value: The title 2
node p, value:  Some interesting text 
node p, value:  Some other interesting text 
node h2, value: The title 3
node p, value:  Some interesting text 

正如您所看到的,源文本保持有序,并且很容易消除您不想要的节点。