Question

//*/text()[string-length() > 100]

...几乎可以使用，但它还会在html document中选择script和style个标记，并且会在遇到<br>或其他标记时停止文字选择。

我想找到直接包含文本的元素，文本大于140个字符，应选择整个元素的文本（有时文本在span内）。

Answer 1

您需要了解difference between text() nodes and string values in XPath。

text()在XPath中选择 text nodes 。显示的br元素您在父元素中的选择表单混合内容：text() 节点和元素混合在一起。
string()是一个XPath函数，它返回XPath表达式的 string value 。要获取忽略br元素的字符串，请选择父div并通过string()直接获取其字符串值或者通过使用a中的表达式隐式获取其字符串值隐含转换为字符串的上下文。

有了这样的背景，你的陈述，

我想找到直接包含文本的元素，文本是应该有超过140个字符和整个元素的文本选中（有时文字在内部跨度）。

可以改写为

我想找到text()个节点子元素的元素，其字符串值的长度大于140。

让我们看一些示例XML，

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

让我们将140减少到8以使其更易于管理，然后

//*[text()][string-length() > 7]

捕获重新描述的要求并选择四个要素：

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

<a>This is a <b>test</b> of mixed content.</a>

<c>asdf asdf asdf asdf</c>

<d>asdf asdf</d>

请注意，它没有选择b，因为其字符串值的长度小于7个字符。

另请注意，由于元素之间只有空格r，因此选择了text()。要消除此类元素，请向text()添加其他谓词：

//*[text()[normalize-space()]][string-length() > 7]

然后，只会选择a，c和d。

如果只想要文本，在XPath 1.0中你可以统一取字符串值：

string(//*[text()[normalize-space()]][string-length() > 7])

如果你想要一个字符串集合，在XPath 1.0中，你需要通过调用XPath的语言迭代元素，但是在XPath 2.0中，你可以在最后添加一个string()步骤：

//*[text()[normalize-space()]][string-length() > 7]/string()

获取三个单独字符串的序列：

This is a test of mixed content.
asdf asdf asdf asdf
asdf asdf

通过XPath直接发送文本内容？

1 个答案: