Question

假设我正在使用递归循环来进行弹性发现和DOM元素的定位，这些元素将适用于来自网站的半结构化和半统一HTML DOM文档。

例如，在网站上抓取链接并在其xpath位置遇到小变化时。需要恢复能力以允许灵活的不间断爬行。

1)我知道我想要一个链接   位于该地区的某个地区   页面与其他页面可区分   （例如菜单的页脚，标题等）

2)因为它可以区分   似乎在一张桌子里面   pargraph或container。

3)可以达到可接受的水平   意外的父母或孩子   在此之前提到的所需链接之前   1)但我不知道是什么。更多   意想不到的元素意味着   离开1)。

4)识别via元素的id和   类或任何其他唯一属性   价值是不可取的。

我认为以下xpath应该总结一下：

/`/p/table/tr/td/a`

在某些页面上有xpath的变体，但它仍然符合1）所需的链接

//p/div/table/tr/td/a或//p/div/span/span/table/tr/td/b/a

我使用缩进来模仿每个循环迭代（

（我应该使用plurral还是单数？孩子和孩子。父母和父母。我认为单数是有意义的，因为这里有直接的父母或孩子。）

排名靠前：

how many p's are there ?
 how many these p's have table as child ? If none, search next sub level. 
   how many these table's have tr as child ? If none, search next sub level.
     how many these tr have td as child ? If none, search next sub level.
      how many these td have a as child ?

开始搜索：

how many a's are there ?
 how many of these a's have td as parent ? If none, look up to the next super level.
  how many of these td have tr as parent ? If none, look up to the next super level.
   how many of these tr have table as parent ? If none, look up to the next super level.
    how many of these table have p as a parent ? If none, look up to the next super level.

自上而下或自下而上是否重要？我认为自上而下是没用的，效率低，如果它在循环结束时转动，则找不到所需的锚链接。

我想我还会测量在循环的每次迭代中发现了多少意外的父母或孩子，并且会比较我对前者感到满意的预设常数。说不超过2.如果有3个或更多意外在发现我想要的锚链之前父母或孩子的迭代，我认为这不是我想要的。

这是正确的做法吗？这只是我想到的最重要的事情。如果这个问题不明确我道歉，我已尽力了。我很想得到关于这个算法的一些意见。

Answer 1

似乎您想要：

//p//table//a

如果您对路径中的中间元素数量有限制，例如不超过2，则上述内容将被修改为：

//p[not(ancestor::*[3])]
      //table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
               /tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]

这将选择其父级或父级为a的所有td元素，其父级为tr，其父级为table，其父级或祖父级为一个少于3个祖先的p - 元素。

自上而下或自下而上的方法来搜索HTML DOM文档中的元素？

1 个答案: