在R中抓取链接时,使用rvest或RSelenium都可以通过定义HTML代码的开头部分来实现,例如给定节点内的href。如果我遇到以下链接怎么办:
<a href="www.website.com" data-tracking="click_body" data-tracking-
data='{"touch_point_button":"photo"}' data-featured-name="listing_no_promo" >
如果我不想获取促销链接,那么我将使用以下代码段(来自XML和httr包):
library(XML)
library(httr)
response <- GET(yourLink)
parsedoc <- htmlParse(response)
xpathSApply(parsedoc, "//a[@data-featured-tracking='listing_no_promo']",
xmlGetAttr, "href")
如果我想获得以xpath的'photo'部分结尾的链接,该怎么办:
data-tracking- data='{"touch_point_button":"photo"}'
不关心促销内容或没有促销内容吗?我的猜测是大括号在这里引起了一些喧嚣。
答案 0 :(得分:0)
我假设您的示例链接结构实际上如下(其中 data-tracking-data 是实际属性:
<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo">link</a>
由于我不知道您正在使用哪个网站,因此我通过将您的链接添加到此页面的正文来重新创建了html文档:
# I'm going to use the jsonlite and xml2 packages
library(jsonlite)
library(xml2)
# This page
stack_url <- "https://stackoverflow.com/questions/40934644/xpath-for-element-whose-attribute-value-ends-with-a-specific-string"
# Your html element example
test_a <- '<a href="www.website.com" data-tracking="click_body" data-tracking-data=\'{"touch_point_button":"photo"}\' data-featured-name="listing_no_promo" >link</a>'
# read in stackoverflow page
raw_page <- read_html(stack_url)
# read in the element a
raw_a <- read_html(test_a)
# add the link element from example to raw_page
xml_add_child(raw_page, raw_a)
# This is just to show that the tag you provided is mixed in with multiple link elements... since this would be the case in your actual use i assume
xml_find_all(raw_page,".//a") %>% tail()
{xml_nodeset (6)}
[1] <a href="https://www.facebook.com/officialstackoverflow/" class="-link">Facebook</a>
[2] <a href="https://twitter.com/stackoverflow" class="-link">Twitter</a>
[3] <a href="https://linkedin.com/company/stack-overflow" class="-link">LinkedIn</a>
[4] <a href="https://creativecommons.org/licenses/by-sa/3.0/" rel="license">cc by-sa 3.0</a>
[5] <a href="https://stackoverflow.blog/2009/06/25/attribution-required/" rel="license">attribution required</a>
[6] <a href="www.website.com" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-f ...
所以我们的xml_document
现在存储到raw_page
了,然后我们将使用xpath来找到我们想要的东西
.//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]
# Our xpath pattern reads as:
#
# - .//a[ -> find all 'a' html elements where
# - attribute::*[contains(.,'{') or contains(.,'photo')] -> any(*) attribute containing either a '{' OR the string 'photo'
# - and @data-tracking -> and the element must have the attribute data-tracking, but it doesn't matter what the value is
# - ] -> end
短期内:
查找所有具有数据跟踪属性并且所有属性包含单词 photo
或字符 {{1 }} 。
{
这将导致:
our_xpath <- ".//a[attribute::*[contains(.,'{') or contains(.,'photo')] and @data-tracking]"
# Extract all of the matching elements using our xpath
# Get all the attribute values for data-tracking-data
# Parse from JSON
xml_find_all(raw_page,our_xpath) %>% xml_attr("data-tracking-data") %>% fromJSON()
我无法针对您的网页进行测试...但是,如果您发布该网址,我很乐意确保它可以正常运行。
答案 1 :(得分:0)
//*[ends-with(@data-tracking-data, '"photo"}')]/@href
在您的示例中,如果数据处理数据以字符串 "photo"}