Question

我正试图从这个网站上抓取电话号码：http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53。可以使用带有选择器.\'id_raw\'\::nth-child(1) span+ div strong的{{3}}包来抓取电话号码（由rvest建议。

问题是可以在单击其掩码后获取信息。所以我不得不打开一个会话，提供一个点击，然后抓取信息。

编辑顺便说一下，它不是一个链接imho。看看来源。我有一个问题，因为我是一个普通的R用户，而不是一个javascript程序员。

selectorGadget

Answer 1

您可以抓取<li>标记中嵌入的数据，告诉onclick处理程序该做什么，直接获取数据：

library(httr)
library(rvest)
library(purrr)
library(stringr)

URL <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53"

pg <- read_html(URL)

html_nodes(pg, "li.rel") %>%       # get the 'special' <li> tags
  html_attrs() %>%                 # extract all the attrs (they're non-standard)
  flatten_chr() %>%                # list to character vector
  keep(~grepl("rel \\{", .x)) %>%  # only want ones with 'hidden' secret data
  str_extract("(\\{.*\\})") %>%    # only get the data
  unique() %>%                     # there are duplicates
  map_df(function(x) {

    path <- str_match(x, "'path':'([[:alnum:]]+)'")[,2]                  # extract out the path
    id <- str_match(x, "'id':'([[:alnum:]]+)'")[,2]                      # extract out the id

    ajax <- sprintf("http://olx.pl/ajax/misc/contact/%s/%s/", path, id)  # make the AJAX/XHR URL
    value <- content(GET(ajax))$value                                    # get the data

    data.frame(path=path, id=id, value=value, stringsAsFactors=FALSE)    # make a data frame

  }) 

## Source: local data frame [3 x 3]
## 
##           path    id       value
##          (chr) (chr)       (chr)
## 1        phone dX6wf 503 155 744
## 2        skype dX6wf    e.bobruk
## 3 communicator dX6wf     7686136

完成所有这些后，我非常失望，网站没有更好的服务/使用条款。很明显，他们真的不希望你抓取这些数据。

Answer 2

以下是使用RSelenium，（RSelenium introduction）和phantomjs的解决方案。

~~但是，我不确定它有多可用，因为它在我的机器上运行速度非常慢，而且我不是幻影或硒专家所以我不知道速度改进在哪里，所以要注意......~~

修改

我再次试过这个，速度似乎没问题。

library(RSelenium) library(rvest) ## Terminal command to start selenium (on ubuntu) ## cd ~/selenium && java -jar selenium-server-standalone-2.48.2.jar url <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53" RSelenium::startServer() remDr <- remoteDriver(browserName = "phantomjs") remDr$open() remDr$navigate(url) # css <- ".cpointer:nth-child(1)" ## couldn't get this to work xp <- "//div[@class='contactbox-indent rel brkword']" webElem <- remDr$findElement(using = 'xpath', xp) # webElem <- remDr$findElement(using = 'css selector', css) webElem$clickElement() ## the page source now includes the clicked element page_source <- remDr$getPageSource()[[1]] pos <- regexpr('class=\\"xx-large', page_source) ## you could write a more intelligent regex, but this works for now phone_number <- substr(page_source, pos + 11, pos + 21) phone_number # "503 155 744" # remDr$close() # remDr$closeServer()

如何使用R网页抓取点击信息？

2 个答案: