Question

我正在使用PHP检索不同的网页，然后将它们加载到DomDocument中，但是我在从叶节点中提取文本时遇到问题。

例如，假设我有以下内容：

<html>
    <body>
        <div class="this_is_our_div_of_interest">
            <div>
                <div>
                    <p>Some text</p>
                    <div>Some <a href='#'>more</a> text</div>
                    <p>And <span><strong>another</strong></span> paragraph</p>
                </div>
                <p>Yay<p>
            </div>
            <div>
                <h4>abcd</ph4>
                xyz
            <div>
        </div>
        <div class="we_do_not_want_those_divs">
            <p>This text is not important to us</p>
        </div>
    </body>
</html>

正如您所看到的，这是一个混乱的输入，但预期的“回声”输出是：

Some text
Some more text
And another paragraph
Yay
abcd
xyz

请注意输出中的以下内容

我只检索特定标记的输出（在我们的例子中，this_is_our_div_of_interest）
这不是上面提供的树的特定格式，因为它来自网页tjat我无法控制其内容，但是，我只想带来标签内容，例如 div 和< strong> p 似乎是叶子节点
有些标记需要省略，例如 a ， span 和 strong （其他可能会添加到列表中）

更新我使用xpath来访问该类，例如，以下代码行将所有后代作为separete节点：

$nodes = $xpath->query("//div[@class='this_is_our_div_of_interest']/descendant::*");

Answer 1

您可以执行以下操作：

$dom = new DOMDocument(); $dom->loadHTMLFile('file.html');
$id = $dom->getElementById('youNeedAnIdForThis');

现在访问$id。

很遗憾没有getElementsByClassName，但我在http://pastebin.com/4qYMEGqV找到了一个。然后你的代码看起来像：

$dom = new DOMDocument(); $dom->loadHTMLFile('file.html');
$class = getElementsByClassName($dom, 'this_is_our_div_of_interest');

$class[0]现在应该保留您正在寻找的内容

那么也许你应该strip_tags()，如果你只是想要文本。

也许看看DOMNode http://www.php.net/manual/en/class.domnode.php#domnode.props.childnodes？

使用PHP中的DomDocument从叶节点中提取文本

1 个答案: