Question

使用5 MB文档时，以下查询需要libxml2 3秒进行评估。有什么办法可以加快速度吗？我需要结果节点集进行进一步处理，因此没有count等等。

谢谢！

descendant::text() | descendant::*
[
self::p or
self::h1 or
self::h2 or
self::h3 or
self::h4 or
self::h5 or
self::h6 or
self::dl or
self::dt or
self::dd or
self::ol or
self::ul or
self::li or
self::dir or
self::address or
self::blockquote or
self::center or
self::del or
self::div or
self::hr or
self::ins or
self::pre
]

修改

按照 Jens Erat 的建议使用descendant::node()[self::text() or self::p or ...（参见接受的答案）显着提高了速度;从最初的2.865330s到完美的0.164336s。

Answer 1

没有任何文件进行基准测试的基准测试非常困难。

优化的两个想法：

使用尽可能少的descendant::轴步骤。它们很贵，可能你可以加快一点点。您可以组合text()和元素测试，如下所示：
```
descendant::node()[self::text() or self::h1 or self::h2]
```
并扩展所有元素（我保持查询简短以获得更好的可读性）。
使用字符串测试代替节点测试。他们可以更快（可能不是，请参阅答案的评论）。当然，您需要保持text()测试。
```
descendant::node()[self::text() or local-name(.) = 'h1' or local-name(.) = 'h2']
```

如果您经常查询同一文档，请考虑使用原生XML数据库，如BaseX，eXist DB，Zorba，Marklogic，......（前三个是免费的）。他们在你的数据上加上索引，应该能够更快地提供的结果（并支持XPath 2.0 / XQuery，这使得开发变得更加容易）。它们都有适用于大量编程语言的API。

Answer 2

您是否在启用了--with-threads选项的情况下编译了libxml2？如果是这样，那么最直接的做法就是在问题上投入更快的处理器和更多核心

Answer 3

您的查询等同于

(descendant::text() | descendant::p
    | descendant::h1  | descendant::h2  | descendant::h3 | descendant::h4  | descendant::h5 | descendant::h6
    | descendant::dl  | descendant::dt  | descendant::dd | descendant::ol  | descendant::ul | descendant::li
    | descendant::dir | descendant::address | descendant::blockquote | descendant::center
    | descendant::del | descendant::div | descendant::hr | descendant::ins | descendant::pre
)

但我无法衡量其速度的任何差异。

大文档XPath查询性能

3 个答案: