Question

我试图从页面中提取标题。到目前为止，一切似乎都可以正常工作，但结果却翻了一番。例如，我获得了h3个标题。在页面上是2次，在页面中是2次。

这里是例子

<span data-img-type='cvr' data-img-att-alt='Cover of Greek Mythology' data-img-size-xs='image.jpg'></span>
<h3> Cover of Greek Mythology </h3>

这将返回

Cover of Greek Mythology
Cover of Greek Mythology

我仅针对h3元素，但它们仍显示为两倍。如何删除重复的元素？

这是我到目前为止所拥有的

$html = file_get_contents('https://example.com/'); 

$scriptDocument = new DOMDocument();

libxml_use_internal_errors(TRUE); 

if(!empty($html)){ 

    $scriptDocument->loadHTML($html);
    libxml_clear_errors(); 
    $scriptDOMXPath = new DOMXPath($scriptDocument);
    //get all the h3's with an class
    $scriptRow = $scriptDOMXPath->query('//h3[@class]');
    //check
    if($scriptRow->length > 0){
        foreach($scriptRow as $row){
            echo $row->nodeValue . "<br/>";
        }
    }
}

如何排除双倍的DOMDocument元素

0 个答案: