Question

我有这个xPath表达式，我将其放入htmlCleaner：

 //table[@class='StandardTable']/tbody/tr[position()>1]/td[2]/a/img

现在，我的问题是它发生了变化，有时候/ a / img元素不存在。所以我想要一个获取所有元素的表达式

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2]/a/img

当/ a / img存在时，

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2]

当/ a / img不存在时。

有没有人知道如何做到这一点？我在另一个问题中找到了一些看起来可能对我有用的东西

descendant-or-self::*[self::body or self::span/parent::body]

但我不明白。

先谢谢了。

Answer 1

使用：

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]
                       [not(a/img)] 

|

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]
                      /a/img

通常，如果我们想要在某个条件$ns1为真时选择一个节点集（$cond）并选择另一个节点集（$ns2），否则，可以使用以下单个XPath表达式指定：

$ns1[$cond] | $ns2[not($cond)]

在这种特殊情况下，ns1 ：

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]
                      /a/img

且ns2 ：

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]

而$cond ：

boolean( (//table[@class='StandardTable']
         /tbody/tr)
             [position()>1]
                       /td[2]
                          /a/img
        )

Answer 2

您可以选择两个互斥表达式的并集（请注意| union运算符）：

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2]/a/img|
//table[@class='StandardTable']/tbody/tr[position()>1]/td[2][not(a/img)]

当第一个表达式返回节点时，第二个表达式将不会（反过来），这意味着您将始终只获得所需的节点。

根据您对@Dimitre的回答，我看到HTMLCleaner并不完全支持XPath 1.0。你真的不需要它。您只需要HTMLCleaner来解析格式不正确的输入。完成该工作后，将其输出转换为标准org.w3c.dom.Document并将其视为XML。

以下是转换示例：

TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);

从现在开始，只需将JAXP与您想要的任何实现一起使用：

XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("/html/body/div/p[not(child::*)]", 
                       doc, XPathConstants.NODE);
System.out.println(node.getTextContent());

输出：

test

Answer 3

这很难看，甚至可能无法奏效，但原则应该是：

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2][exists( /a/img )]/a/img | //table[@class='StandardTable']/tbody/tr[position()>1]/td[2][not( exists( /a/img ) )]

xPath表达式：获取元素，即使它们不存在

3 个答案: