Question

我正在使用Simple html dom来抓取一个网站。我遇到的问题是文本位于任何特定元素的外。它似乎内在的唯一元素是<div id="content">。

<div id="content">
    <div class="image-wrap"></div>
    <div class="gallery-container"></div>
    <h3 class="name">Here is the Heading</h3>

    All the text I want is located here !!!

    <p> </p>
    <div class="snapshot"></div>
</div>

我猜网站管理员搞砸了，文字实际上应该在<p>标签内。

我已尝试使用以下代码，但它只是无法检索文字：

    $t = $scrape->find("div#content text",0);
    if ($t != null){
        $text = trim($t->plaintext);
    }

我还是个新手还在学习。任何人都可以提供帮助吗？

Answer 1

你几乎就在那里......使用测试循环来显示节点的内容并找到所需文本的索引。例如：

// Find all texts
$texts = $html->find('div#content text');

foreach ($texts as $key => $txt) {
    // Display text and the parent's tag name
    echo "<br/>TEXT $key is ", $txt->plaintext, " -- in TAG ", $txt->parent()->tag ;
}

您会发现应该使用索引4而不是0：

$scrape->find("div#content text",4);

如果您的文本并不总是具有相同的索引，但您知道它遵循h3标题，那么您可以使用以下内容：

foreach ($texts as $key => $txt) {
    // Locate the h3 heading
    if ($txt->parent()->tag == 'h3') {
        // Grab the next index content from $texts
        echo $texts[$key+1]->plaintext;
        // Stop
        break;
    }
}

获取元素之外的文本

1 个答案: