我有以下HTML文档:
<div>
<span>Line 1</span>
<p>
<span class='inline'>This</span>
text should
<span class='inline'>be in</span>
one
<span class='inline'>line</span>
<span class='inline'>all together</span>
</p>
<em>
<span class='inline'>This</span>
line
<span class='inline'>too</span>
</em>
<a href="#">Line 4</a>
<div>
<p>
<span class='inline'>This fourth</span>
line
<span class='inline'>too</span>
</p>
</div>
<script type="text/javascript">//...</script>
<b></b>
</div>
应提取的文字:
Line 1
This text should be in one line all together
This line too
Line 4
This fourth line too
目前我正在使用//div//descendant::*[not(self::script)]/text()[string-length() > 0]
来提取文字。
这导致以下结果:
Line 1
This
text should
be in
one
line
all together
This
line
too
Line 4
This fourth
line
too
如果使用“内联”类,我如何组合文本?或者,如果在子节点内发现了“内联”类,我该如何使用父节点的文本?
请注意,这是一个示例: p和em标记可能会有所不同!
答案 0 :(得分:0)
也许你正在寻找错误的观点。突然出现在我眼中的是你正在寻找 div (这里也是根)元素的任何孩子的文本内容 - 但对于脚本标签而言是空的:
/div/*[name() != "script" and string-length(normalize-space())]
我的xpath示例也进行空间规范化。例如。如果<b></b>
为<b> </b>
或有一些换行符,那么它也有资格为空。
阅读DOMNode::$textContent
并用它标准化空格会产生以下结果:
string(6) "Line 1"
string(44) "This text should be in one line all together"
string(13) "This line too"
string(6) "Line 4"
string(20) "This fourth line too"
这是一个快速的PHP示例代码demonstrating this:
<?php
$buffer = <<<XML
<div>
<span>Line 1</span>
<p>
<span class='inline'>This</span>
text should
<span class='inline'>be in</span>
one
<span class='inline'>line</span>
<span class='inline'>all together</span>
</p>
<em>
<span class='inline'>This</span>
line
<span class='inline'>too</span>
</em>
<a href="#">Line 4</a>
<div>
<p>
<span class='inline'>This fourth</span>
line
<span class='inline'>too</span>
</p>
</div>
<script type="text/javascript">//...</script>
<b></b>
</div>
XML;
$xml = simplexml_load_string($buffer);
$result = $xml->xpath('/div/*[name() != "script" and string-length(normalize-space())]');
foreach ($result as $node) {
$text = dom_import_simplexml($node)->textContent;
$text = preg_replace(['(\s+)u', '(^\s|\s$)u'], [' ', ''], $text);
var_dump($text);
}