Question

我通常使用来自read_html的{{1}}命令来抓取html表没有问题但是，我在使用特定网站时遇到了一些麻烦。任何帮助将非常感激。这是我的工作流程：

rvest

我最终得到的是一个包含正确列标题但没有数据的表格！我想在该网站上抓第二张桌子。

#Dependencies
library(rvest)
library(pipeR)

#Scrape table from site
url2 <- "http://priceonomics.com/hotels/rankings/#airbnb-apartments-all"
data2 <- url2 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="airbnb-apartments-all"]/table') %>%
  html_table(fill = TRUE)
data2<-data2[[1]]

我使用谷歌浏览器来识别xpath。我也尝试了以下内容：

data2
[1] Rank City $        
<0 rows> (or 0-length row.names)

产生：

readHTMLTable(url2)

最后，如果网站使用的是Java，我尝试使用R的$`NULL` NULL $`NULL` NULL $`NULL` NULL包，但我似乎无法正确连接到服务器：

RSelenium

Answer 1

所有数据都在JSON文件中返回。您可以使用它构建表。例如第一个表：

library(jsonlite)
library(data.table)
appData <- fromJSON("http://priceonomics.com/static/js/hotels/all_data.json")
# replicate table
myDf <- data.frame(City = names(appData), Price = sapply(appData, function(x) x$air$apt$p)
                   , stringsAsFactors = FALSE)
setDT(myDf)
> myDf[order(Price, decreasing = TRUE)][1:10]
City Price
1:        Boston, MA 185.0
2:      New York, NY 180.0
3: San Francisco, CA 165.0
4:     Cambridge, MA 155.0
5:    Scottsdale, AZ 142.5
6:     Charlotte, NC 139.5
7:    Charleston, SC 139.5
8:     Las Vegas, NV 135.0
9:         Miami, FL 135.0
10:       Chicago, IL 130.0

Answer 2

感谢@jdharrison上面的回复。我最终也成功地采用了Selenium方法。这是我的工作流程：

#Load dependencies
devtools::install_github("ropensci/RSelenium", force=T)
library(RSelenium)

#Access Chrome driver
checkForServer(update=T)
startServer(javaargs="/users/name/folder/chromedriver") #path to where chromedriver is located on local hard (downloaded from: https://sites.google.com/a/chromium.org/chromedriver/downloads)
remDr <- remoteDriver(browserName = "chrome") 
remDr$open()

#Navigate to url, read, and sparse html table into dataframe
remDr$navigate("http://priceonomics.com/hotels/rankings/#airbnb-apartments-all")
doc <- htmlParse(remDr$getPageSource()[[1]])
doc<-readHTMLTable(doc)
data2<-doc[[2]]

使用R的rvest包和RSelenium进行Web抓取

2 个答案: