基于条件的XPath组合节点文本

时间:2015-11-28 04:50:05

标签: php xpath

我有以下HTML文档:

<div>
  <span>Line 1</span>
  <p>
    <span class='inline'>This</span>
    text should 
    <span class='inline'>be in</span>
    one 
    <span class='inline'>line</span>
    <span class='inline'>all together</span>
  </p>
  <em>
    <span class='inline'>This</span>
    line
    <span class='inline'>too</span>
  </em>
  <a href="#">Line 4</a>
  <div>
    <p>
      <span class='inline'>This fourth</span>
      line
      <span class='inline'>too</span>
    </p>
  </div>
  <script type="text/javascript">//...</script>
  <b></b>
</div>

应提取的文字:

Line 1
This text should be in one line all together
This line too
Line 4
This fourth line too

目前我正在使用//div//descendant::*[not(self::script)]/text()[string-length() > 0]来提取文字。

这导致以下结果:

Line 1
This
text should
be in
one
line
all together
This
line
too
Line 4
This fourth
line
too

如果使用“内联”类,我如何组合文本?或者,如果在子节点内发现了“内联”类,我该如何使用父节点的文本?

请注意,这是一个示例: p和em标记可能会有所不同!

1 个答案:

答案 0 :(得分:0)

也许你正在寻找错误的观点。突然出现在我眼中的是你正在寻找 div (这里也是根)元素的任何孩子的文本内容 - 但对于脚本标签而言是空的:

/div/*[name() != "script" and string-length(normalize-space())]

我的xpath示例也进行空间规范化。例如。如果<b></b><b> </b>或有一些换行符,那么它也有资格为空。

阅读DOMNode::$textContent并用它标准化空格会产生以下结果:

string(6) "Line 1"
string(44) "This text should be in one line all together"
string(13) "This line too"
string(6) "Line 4"
string(20) "This fourth line too"

这是一个快速的PHP示例代码demonstrating this

<?php

$buffer = <<<XML
<div>
  <span>Line 1</span>
  <p>
    <span class='inline'>This</span>
    text should
    <span class='inline'>be in</span>
    one
    <span class='inline'>line</span>
    <span class='inline'>all together</span>
  </p>
  <em>
    <span class='inline'>This</span>
    line
    <span class='inline'>too</span>
  </em>
  <a href="#">Line 4</a>
  <div>
    <p>
      <span class='inline'>This fourth</span>
      line
      <span class='inline'>too</span>
    </p>
  </div>
  <script type="text/javascript">//...</script>
  <b></b>
</div>
XML;

$xml = simplexml_load_string($buffer);
$result = $xml->xpath('/div/*[name() != "script" and string-length(normalize-space())]');
foreach ($result as $node) {
    $text = dom_import_simplexml($node)->textContent;
    $text = preg_replace(['(\s+)u', '(^\s|\s$)u'], [' ', ''], $text);
    var_dump($text);
}