Question

问题：我想找到在HTML文档中形成文本字符串的节点的xpath。使用的语言是python（用于解析文档的lxml）

为了说明这个想法，请考虑以下文件：

＆＃13;

<HTML>

<HEAD>
  <TITLE>sample document</TITLE>
</HEAD>

<BODY BGCOLOR="FFFFFF">
  <HR>
  <a href="http://google.com">Goog</a>
  <H1>This is one header</H1>
  <H2>This is a another Header</H2>
  <P>Travel from
    <P>
      <B>SFO to JFK</B>
      <BR>
      <B><I>on May 2, 2015 at 2:00 pm. For details go to confirm.com </I></B>
      <HR>
      <div style="color:#0000FF">
        <h3>Traveler <b> name </b> is
        <p> John Doe </p>
      </div>
.....

＆＃13;

现在，鉴于2015年5月2日和＃34; SFO到肯尼迪国际机场的字符串和＃34;和＃34; Traveler名称是John Doe＆＃34;，如何获得构成字符串的节点集中第一个节点的Xpath。（如果那个节点集很难做到的话）

示例输出：

"SFO to JFK on May 2, 2015" -> /html/body/p/p/b
"Traveler name is John Doe" -> /html/body/p/p/div/h3

作为后续内容，如果我们有正则表达式，而不是上面的字符串，那么解决问题的方法是什么？

注意：就python实现而言，我正在解决问题，如下面的代码段所示

import lxml.html as lh
from StringIO import StringIO
from lxml import etree

elem_tree = lh.parse(StringIO(html_document))
xpath = etree.XPath(_the_xpath_here)
list_of_nodes = xpath(elem_tree)

Answer 1

您可以尝试这种方法：

import lxml.html as lh
from lxml import etree

elem_tree = lh.parse("Q12.html")
input_string = ["SFO to JFK on May 2, 2015", "Traveler name is John Doe"]

for i in input_string:
    xpath = "//*[contains(normalize-space(.), '{0}') and not(.//*[contains(normalize-space(.), '{0}')])]/*"
    node = elem_tree.xpath(xpath.format(i))[0]

    print '{0} -> {1}'.format(i, elem_tree.getpath(node))

    #Output:
    #SFO to JFK on May 2, 2015 -> /html/body/p[2]/b[1]
    #Traveler name is John Doe -> /html/body/div/h3

简要说明：

contains(normalize-space(.), '{0}')：过滤包含文字的节点（input_string之一
not(.//*[contains(normalize-space(.), '{0}')])：如果节点的任何后代不包含文本，请选择该节点。换句话说，选择包含文本的最内层节点。
getpath()：“返回一个结构的绝对XPath表达式来查找元素。”

更新：

将/*变量字符串中的尾随xpath替换为：

/descendant-or-self::*[contains('{0}', text()) or contains(text(), '{0}')]

针对发布的HTML结构以及您在下方评论中链接的HTML结构工作。但是，解决具有与样本HTML所展示的不同特征的一般情况超出了本答案中xpath查询的范围。

获取在HTML文档中形成字符串的节点的xpath

1 个答案: