我想获取在此网站上找到的所有链接的地址:
http://www.marktplaats.nl/z/bureaustoel.html?query=bureaustoel¤tPage=4
因此我写了以下代码:
create_links <- function(keyword, distance) {
data.frame <- data.frame(character(), character(), stringsAsFactors=F)
postcode <- c(3511, 4000, 5000)
var_website1 <- "http://www.marktplaats.nl/z.html?query="
var_website2 <- "&postcode="
var_website3 <- "&distance="
for (i in 1:length(postcode)) {
website <- paste0(var_website1, keyword, var_website2, postcode[i], var_website3, distance)
html <- read_html(website)
number <- html_nodes(html, "span data-url")
print(number)
}
}
但是变量号不返回链接。它返回:
{xml_nodeset (0)}
如果我点击上面提到的网站的视图,我会看到这一点(第+/- 2125行):
....
<figure class="cell column-thumb ">
<span data-url="http://www.marktplaats.nl/a/huis-en-inrichting/bureaus-en-bureaustoelen/a1019445395-grote-voorraad-vitra-ea217-219-eames-bureaustoelen-tip.html?c=d97e27c274e75147b4afd0f5eb58c81b&previousPage=lr" class="thumb-placeholder-centered juiceless-link " title="Grote voorraad Vitra EA217/219 Eames bureaustoelen (TIP!)" data-cas-tracking="EjuiBbJSjW9sVgjIAyeExca2zcNmdAbNqxnHL3TgJyETa5q3TEDHkbBhIP9knzWvCulPWdoWiqXvcyfnSlrtRj2yyEXBtTYpthUSEEyz3_jWR5WtL1zsTojR6ptN1zVrghXZKtNjPuwefDWPO4kPTTU8raZkSZpQ0Az18CMyPs8bHLPPXWngYk4RFiRQKTi8nKsyBIq4dRTk1FDIm-rscC8MiYUK0WcmnCnF-fJEFbvwmTiI2VUwvg-VySDb9F48wEc9WcVLaD0amDazxXTK1TkM0T5jDK1oVnlC7t0fcm0xCiqHrJZCW5aIDq-RYxLgYl32mIz4pyskjhD2WnOXciaT5tAE_e61pRAWvUXMBEn4WRpS2aSdTfa9oNaPuF8W2j00kRrrIEPF-2miQv3JQATPxT4WpLurqOoXzNAfNccZHOJ6cKpgy5s7xn7AnylRm8PIh-GCos_L8FlxOIIC6BeYTveRK4M7ua3HwJCJXiJDwZ_uxvWSsOj7VWpRoLn-NFci_L2i_PbyQQbQP3bT0iMdxqoO2hV6OA3sa5rl4PyC5X92F70hiIvuUlE_nxkx_p0kq6hJCqJ54lfitU5ObgwqeO7U1mQloh8e_wFzlqC1pWuEFtnNa6t4H_aIz-HOqlHjcsAhWTqN8_zhG7MnMEr3h50Wg1a1kHgr6Sw6ckr6VQO1j2pvjZKD9KS7Hjy_v-gZrh8ggJ9qwdORv1OlUdQasEAniKExm4pCY2zKdTXB0Rqw7u_MVxp4FMr-W7UdBklWFpHQM8-vMaGYkGXrhKbRYTHXIHACby9fSca0xo_ixHOL77hRJj_SU-eKxswvvhfEgwH6g-iQlcb8mRqMs5W6CKhrPGWXMNHODgila1MCRobPPJNPrQ">
<div class="listing-image">
有关如何检索链接的任何想法?
答案 0 :(得分:0)
你可以这样做,但是,你可能需要过滤出广告:
library(rvest)
library(magrittr)
data.frame <- data.frame(character(), character(), stringsAsFactors=F)
postcode <- c(3511, 4000, 5000)
keyword <- "bureaustoel"
distance <- 3000
var_website1 <- "http://www.marktplaats.nl/z.html?query="
var_website2 <- "&postcode="
var_website3 <- "&distance="
for (i in 1:length(postcode)) {
website <- paste0(var_website1, keyword, var_website2, postcode[i], var_website3, distance)
cont <- website %>% read_html()
cont <- cont %>% html_nodes(xpath = "/html/body/div[2]/div/div[4]/div[2]/section[2]")
number <- cont %>% html_children %>% length
print(number)
# with postcode = 3511, you get:
# [1] 41
}
棘手的部分是找到xpath
,您可以在Firefox中使用Firebug或使用Chrome。
您可以在此博文中找到另一个关于如何使用rvest的示例:https://datashenanigan.wordpress.com/2015/04/30/using-rvest-and-dplyr-to-look-at-aviation-incidents/