使用rvest

时间:2015-11-12 13:16:37

标签: r screen-scraping

我想获取在此网站上找到的所有链接的地址:

http://www.marktplaats.nl/z/bureaustoel.html?query=bureaustoel&currentPage=4

因此我写了以下代码:

create_links <- function(keyword, distance) {

 data.frame <- data.frame(character(), character(), stringsAsFactors=F)
 postcode <- c(3511, 4000, 5000)

  var_website1 <- "http://www.marktplaats.nl/z.html?query="
  var_website2 <- "&postcode="
  var_website3 <- "&distance="

for (i in 1:length(postcode)) {

    website <- paste0(var_website1, keyword, var_website2, postcode[i],   var_website3, distance)

    html <- read_html(website)
    number <- html_nodes(html, "span data-url")
    print(number)
     }
    }

但是变量号不返回链接。它返回:

{xml_nodeset (0)}

如果我点击上面提到的网站的视图,我会看到这一点(第+/- 2125行):

....
<figure class="cell column-thumb ">
            <span data-url="http://www.marktplaats.nl/a/huis-en-inrichting/bureaus-en-bureaustoelen/a1019445395-grote-voorraad-vitra-ea217-219-eames-bureaustoelen-tip.html?c=d97e27c274e75147b4afd0f5eb58c81b&previousPage=lr" class="thumb-placeholder-centered juiceless-link " title="Grote voorraad Vitra EA217/219 Eames bureaustoelen (TIP!)" data-cas-tracking="EjuiBbJSjW9sVgjIAyeExca2zcNmdAbNqxnHL3TgJyETa5q3TEDHkbBhIP9knzWvCulPWdoWiqXvcyfnSlrtRj2yyEXBtTYpthUSEEyz3_jWR5WtL1zsTojR6ptN1zVrghXZKtNjPuwefDWPO4kPTTU8raZkSZpQ0Az18CMyPs8bHLPPXWngYk4RFiRQKTi8nKsyBIq4dRTk1FDIm-rscC8MiYUK0WcmnCnF-fJEFbvwmTiI2VUwvg-VySDb9F48wEc9WcVLaD0amDazxXTK1TkM0T5jDK1oVnlC7t0fcm0xCiqHrJZCW5aIDq-RYxLgYl32mIz4pyskjhD2WnOXciaT5tAE_e61pRAWvUXMBEn4WRpS2aSdTfa9oNaPuF8W2j00kRrrIEPF-2miQv3JQATPxT4WpLurqOoXzNAfNccZHOJ6cKpgy5s7xn7AnylRm8PIh-GCos_L8FlxOIIC6BeYTveRK4M7ua3HwJCJXiJDwZ_uxvWSsOj7VWpRoLn-NFci_L2i_PbyQQbQP3bT0iMdxqoO2hV6OA3sa5rl4PyC5X92F70hiIvuUlE_nxkx_p0kq6hJCqJ54lfitU5ObgwqeO7U1mQloh8e_wFzlqC1pWuEFtnNa6t4H_aIz-HOqlHjcsAhWTqN8_zhG7MnMEr3h50Wg1a1kHgr6Sw6ckr6VQO1j2pvjZKD9KS7Hjy_v-gZrh8ggJ9qwdORv1OlUdQasEAniKExm4pCY2zKdTXB0Rqw7u_MVxp4FMr-W7UdBklWFpHQM8-vMaGYkGXrhKbRYTHXIHACby9fSca0xo_ixHOL77hRJj_SU-eKxswvvhfEgwH6g-iQlcb8mRqMs5W6CKhrPGWXMNHODgila1MCRobPPJNPrQ">
                <div class="listing-image">

有关如何检索链接的任何想法?

1 个答案:

答案 0 :(得分:0)

你可以这样做,但是,你可能需要过滤出广告:

library(rvest)
library(magrittr)

data.frame <- data.frame(character(), character(), stringsAsFactors=F)
postcode <- c(3511, 4000, 5000)
keyword <- "bureaustoel"
distance <- 3000

var_website1 <- "http://www.marktplaats.nl/z.html?query="
var_website2 <- "&postcode="
var_website3 <- "&distance="

for (i in 1:length(postcode)) {

  website <- paste0(var_website1, keyword, var_website2, postcode[i],   var_website3, distance)

  cont <- website %>% read_html()

  cont <- cont %>% html_nodes(xpath = "/html/body/div[2]/div/div[4]/div[2]/section[2]")

  number <- cont %>% html_children %>% length
  print(number)
  # with postcode = 3511, you get:
  # [1] 41
}

棘手的部分是找到xpath,您可以在Firefox中使用Firebug或使用Chrome。

您可以在此博文中找到另一个关于如何使用rvest的示例:https://datashenanigan.wordpress.com/2015/04/30/using-rvest-and-dplyr-to-look-at-aviation-incidents/