在R中使用Selenium包刮掉锚定网站

时间:2015-02-12 18:15:23

标签: r selenium anchor-scroll

我是R的新手,我在从福布斯网站上提取数据时遇到了麻烦。

我目前的职能是:

url =

http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states

data = readHTMLTable(url)

但是,福布斯网站的链接中带有“#”符号。我下载了rselenium软件包以解析我想要的数据,但我并不精通reselenium。

有没有人对reselenium有任何建议/专业知识?我如何使用reselenium从福布斯获取数据?理想情况下,我想从网站上提取第1,2页等数据。

谢谢!

2 个答案:

答案 0 :(得分:4)

或者使用用于填充网页的API的其他方式。这会一次下载所有2000家公司。

library(httr)
library(RJSONIO)
url <- "http://www.forbes.com/ajax/load_list/"
query <- list("type" = "organization",
              "uri" = "global2000",
              "year" = "2014")
response <- httr::GET(url, query=query)
dat_string <- as(response, "character")
dat_list <- RJSONIO::fromJSON(dat_string, asText=TRUE)
df <- data.frame(rank = sapply(dat_list, "[[", 1),
                 company = sapply(dat_list, "[[", 3),
                 country=sapply(dat_list, "[[", 10),
                 sales=sapply(dat_list, "[[", 6),
                 profits=sapply(dat_list, "[[", 7),
                 assets=sapply(dat_list, "[[", 8),
                 market_value=sapply(dat_list, "[[", 9), stringsAsFactors=F)
df <- df[order(df$rank),]

答案 1 :(得分:1)

这有点hacky,但这是我使用rvest和read.delim的解决方案...

library(rvest)

url <- "http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states"
a <- html(url) %>%
  html_nodes("#thelist") %>%
  html_text()
con <- textConnection(a)
df <- read.delim(con, sep="\t", header=F, skip=12, stringsAsFactors=F)
close(con)
df$V1[df$V1==""] <- df$V3[df$V1==""]
df$V2 <- df$V3 <- NULL
df <- subset(df, V1!="")
df$index <- 1:nrow(df)
df2 <- data.frame(company=df$V1[df$index%%6==1],
                  country=df$V1[df$index%%6==2],
                  sales=df$V1[df$index%%6==3],
                  profits=df$V1[df$index%%6==4],
                  assets=df$V1[df$index%%6==5],
                  market_value=df$V1[df$index%%6==0])