简单的HTML DOM解析器 - 获取所有plaintex而不是某些元素的文本

时间:2013-05-04 10:12:45

标签: php parsing dom html-parsing web-scraping

我尝试了question上发布的所有解决方案。虽然它与我的问题类似,但它的解决方案并不适用于我。

我正在尝试获取<b>之外的纯文本,它应位于<div id="maindiv>内。

<div id=maindiv>
     <b>I don't want this text</b>
     I want this text
</div>

$ part 是包含<div id="maindiv">的对象。 现在我尝试了这个:

$part->find('!b')->innertext;

上面的代码不起作用。我试过这个时

$part->plaintext;

它返回了所有这样的纯文本

I don't want this text I want this text

我阅读了官方文档,但我找不到任何解决方法:

2 个答案:

答案 0 :(得分:0)

查询:

$selector->query('//div[@id="maindiv"]/text()[2]')

说明:

//               - selects nodes regardless of their position in tree

div              - selects elements which node name is 'div'

[@id="maindiv"]  - selects only those divs having the attribute id="maindiv"

/                - sets focus to the div element

text()           - selects only text elements

[2]              - selects the second text element (the first is whitespace)

                   Note! The actual position of the text element may depend on
                   your preserveWhitespace setting.

                   Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace

示例:

$html = <<<EOF
<div id="maindiv">
     <b>I dont want this text</b>
     I want this text
</div>
EOF;

$doc = new DOMDocument();
$doc->loadHTML($html);

$selector = new DOMXpath($doc);   

$node = $selector->query('//div[@id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text

答案 1 :(得分:0)

首先删除<b>

$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text