Question

我遇到了nokogiri和xpath的奇怪问题。我想解析一个HTML文档，并通过href值和它们包含的锚文本获取所有链接。

到目前为止，这是我的xpath：

    xpath = "//a[contains(text(), #{link['anchor_text']}) and @href='#{link['target_url']}']"
    a = doc.search(xpath)

只要link ['anchor_text']是没有数字的字符串，这个工作正常。

如果我试图获取锚文本“11example”的链接，则会抛出以下错误：

    Invalid expression: //a[contains(text(), 11example) and @href='http://www.example.com/']

也许这只是一个愚蠢的错误，但我不明白为什么会出现这种错误。如果我在xpath中围绕＃{link ['anchor_text']}添加一些引号，那么一切都无效。

编辑：以下是HTML示例：

<!DOCTYPE html>
<head>
  <title>Example.com</title>
</head>
<body>
<p>
<strong>Here is some text</strong><br />
<a href="example.com" target="_blank">11example</a>Some text here and there
</p>
<p>
<strong>Another text</strong><br />
<a href="example.com/test" target="_blank">example.com</a>Some text here and there
</p>
</body>

Edit2：如果我在irb控制台中手动运行这些查询，一切都按预期工作，但前提是我将文本放在引号中。

提前致谢！

亲切的问候， madhippie

Answer 1

简单的答案是，#{link['anchor_text']}周围缺少引号，就像你在#{link['target_url']}附近一样。完整的XPath应该是

xpath = "//a[contains(text(), '#{link['anchor_text']}') and @href='#{link['target_url']}']"

当您不以数字开头时，它似乎工作（至少不会产生错误）的原因是该字符串被解释为节点查询。例如，Nokogiri正在<example.com>标记内查找名为<a>的标记，然后将其转换为字符串，并查看<a>标记的文本节点是否包含该字符串。如果标签不存在（如本例所示），则contains的结果始终为真。

作为演示，使用HTML：

<a href="example.com"><q>foo</q>example</a>
<a href="example.com"><q>foo</q>foo</a>
<a href="example.com">foo</a>

然后查询

doc.search("//a[contains(text(), q)]")

与第一个<a>标记不匹配，但与第二个和第三个匹配。

当字符串以数字开头时，它无法解析为节点查询，因为以数字开头的名称不是有效的XML（或HTML）元素名称，因此您会收到错误。

如果字符串以数字开头，则包含throws的xpath错误

1 个答案: