Question

我正在尝试解析一个网站，以提取人们的姓名和国家。

页面有时看起来像：

<th>Inventors:</th>
    <td align="left" width="90%">
            <b>Harvey; John Christopher</b> (New York, NY)<b>, Cuddihy; James William</b> (New York, NY)
    </td>

我可以使用

获取国家/地区

//th[contains(text(), "Inventors:")]/following-sibling::td/b[contains(text(),";")]/following-sibling::text()

[(New York, NY), (New York, NY)]

有时页面看起来像（添加了国家/地区名称）：

<th>Inventors:</th>
    <td align="left" width="90%">
        <b>Harvey; John Christopher</b> (New York, <b>NY</b>)<b>, Cuddihy; James William</b> (New York, <b>NY</b>)
    </td>

我可以通过以下方式获得国家：

//th[contains(text(), "Inventors:")]/following-sibling::td/b[contains(text(),";")]/following-sibling::b

[NY, NY]

现在，我希望能够在两种情况下都能获得这些国家。

我尝试过：

//th[contains(text(), "Inventors:")]/following-sibling::td/b[contains(text(),";")]/following-sibling::*[self::text() or self::b]

然后我只得到“b”......

我也试过了：

//.../following-sibling::text() | //.../following-sibling::b

但我也只得到了“b”......

知道为什么这不能按预期工作？获得两个条目的任何解决方案？

Answer 1

您可以使用

string(//th[.="Inventors:")]/following-sibling::td)

这样您就可以选择

Harvey; John Christopher (New York, NY), Cuddihy; James William (New York, NY)

两种情况。然后使用XPath 2.0字符串/正则表达式处理函数，或者如果只有XPath 1.0可用，则使用调用语言中的那些工具。

Answer 2

您也可以尝试以下方式：

//th[contains(text(), "Inventors:")]
    /following-sibling::td/b[contains(text(),";")]
    /following-sibling::node()[not(self::b[contains(text(),";")])]

这将选择所有后续兄弟节点，但忽略包含“;”的b节点。

如何获得follow-sibling :: text（）和follow-sibling :: b？

2 个答案: