rvest | Web将数据转换为长格式

时间:2015-10-17 08:08:45

标签: r google-maps data.table dplyr transformation

虽然网络抓取我遇到了以下问题,我认为可能有更好的解决方案:

拥有这些数据:

dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany"))

               query
1    Washington, USA
2 Frankfurt, Germany

我想查询,例如Google Maps Api并返回格式化的地址。可能有多个格式化。结果应如下:

               query         formatted_address
1    Washington, USA       Washington, DC, USA
2    Washington, USA       Washington, UT, USA
3    Washington, USA Washington, VA 22747, USA
4    Washington, USA Washington, IA 52353, USA
5    Washington, USA Washington, GA 30673, USA
6    Washington, USA Washington, PA 15301, USA
7 Frankfurt, Germany        Frankfurt, Germany

我现在所做的是:

require(RCurl)
require(rvest)
require(magrittr)

build_url <- function(x, base_url = "https://maps.googleapis.com/maps/api/geocode/xml?address="){
  paste0(base_url, RCurl::curlEscape(x))
}

l <- lapply(dat$query, function(q){
  formatted_address <- q %>% build_url %>% read_xml %>% xml_nodes("formatted_address") %>% xml_text
  data.frame(query = q, formatted_address)
})

do.call(rbind, l) # This can be done via data.table::rbindlist as well

有更好的解决方案吗?可能更多data.tabledplyr样式?

1 个答案:

答案 0 :(得分:0)

我已经使用有效的API密钥编写了包googleway来访问google maps API(因此,如果您的数据超过2,500项,则可以为API密钥付费)。

要获取地址详细信息,请使用google_geocode()

library(googleway)

key <- "your_api_key"

dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany"))

## To get all the data:
res <- apply(dat, 1, function(x){
  google_geocode(address = x["query"],
                 key = key)  ## use simplify = F to return JSON
})

## to access the 'formatted address' part, see
res[[1]]$results$formatted_address
# [1] "Washington, DC, USA"       "Washington, UT, USA"       "Washington, VA 22747, USA" "Washington, IA 52353, USA"
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA"

## so to get everything as a list
lapply(res, function(x){
  x$results$formatted_address
})

# [[1]]
# [1] "Washington, DC, USA"       "Washington, UT, USA"       "Washington, VA 22747, USA" "Washington, IA 52353, USA"
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA"
# 
# [[2]]
# [1] "Frankfurt, Germany"

## and to put back onto your original data.frame:
lst <- lapply(1:length(res), function(x){
  data.frame(query = dat[x, "query"],
             formatted_address = res[[x]]$results$formatted_address)
})

data.table::rbindlist(lst)
#                 query         formatted_address
# 1:    Washington, USA       Washington, DC, USA
# 2:    Washington, USA       Washington, UT, USA
# 3:    Washington, USA Washington, VA 22747, USA
# 4:    Washington, USA Washington, IA 52353, USA
# 5:    Washington, USA Washington, GA 30673, USA
# 6:    Washington, USA Washington, PA 15301, USA
# 7: Frankfurt, Germany        Frankfurt, Germany