Question

我正在尝试从亚马逊中检索数据。网址在这里。

http://www.amazon.com/Logitech-Wireless-Marathon-3-year-Battery/product-reviews/B003TG75EG/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending

这是一个产品评论页面。我发现数据介于这两个标签之间，如下所示

<div style="margin-bottom:0.5em;">
395 of 405 people found the following review helpful
</div>

问题是这两个标签之间还包含其他信息。有没有人有一些好主意来检索这些数据？

谢谢。

Answer 1

您的问题不清楚，但我猜您实际上想要取回 395 ，而不是全文。

您可以像这样返回元素（我认为这是一个更好的解决方案，因为标记和类名称可以轻松更改，但ID recMHRL可能会保留）

/div[@id = "revMHRL"]/div/div/span[contains(@class, "a-size-small")][contains(@class, "a-color-secondary")]

并提取您可以执行的号码

tokenize(normalize-space(/div[@id = "revMHRL"]/div/div/span[contains(@class, "a-size-small")][contains(@class, "a-color-secondary")]/text()), "\s+")[1]

首先删除了前导和traling白色空格，然后根据空格对字符串进行标记，只返回第一个元素。

Answer 2

我假设您要从第一次审核中提取。另外，我假设您只有XPATH 1.0函数而不是XPATH 2，因此没有可用的tokenize函数。

首先，到目前为止建议的表达过多依赖于页面的结构，亚马逊经常变化。这意味着同样可能在几天内失败。选择所需节点的更好表达是

//*[@id='revMH']/h3/following::node()[contains(text(),'people found the following review helpful')][1]

因为亚马逊不太可能将显示的文字更改为用户。

完成后，要提取 395 ，您可以使用：

substring-before(//*[@id='revMH']/h3/following::node()[contains(text(),'people found the following review helpful')][1]," of")

如果你想要 395 of 405 ，只需使用substring-before(.....,' people')，然后用你的宿主语言拆分这两个数字。您甚至可以使用translate来获取文字，例如 395/405 ，

translate(normalize-space(//div[@id = "revMHRL"]/div/div/span[contains(@class, "a-size-small")][contains(@class, "a-color-secondary")]/text()),"of",'/')

Answer 3

请尝试此xpath

//div[@class='a-section']/div[@class='a-row a-spacing-micro']/span[@class='a-size-small a-color-secondary']/text()

如何从亚马逊检索数据，“有多少人认为此评论有用”？

3 个答案: