虽然网络抓取我遇到了以下问题,我认为可能有更好的解决方案:
拥有这些数据:
dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany"))
query
1 Washington, USA
2 Frankfurt, Germany
我想查询,例如Google Maps Api并返回格式化的地址。可能有多个格式化。结果应如下:
query formatted_address
1 Washington, USA Washington, DC, USA
2 Washington, USA Washington, UT, USA
3 Washington, USA Washington, VA 22747, USA
4 Washington, USA Washington, IA 52353, USA
5 Washington, USA Washington, GA 30673, USA
6 Washington, USA Washington, PA 15301, USA
7 Frankfurt, Germany Frankfurt, Germany
我现在所做的是:
require(RCurl)
require(rvest)
require(magrittr)
build_url <- function(x, base_url = "https://maps.googleapis.com/maps/api/geocode/xml?address="){
paste0(base_url, RCurl::curlEscape(x))
}
l <- lapply(dat$query, function(q){
formatted_address <- q %>% build_url %>% read_xml %>% xml_nodes("formatted_address") %>% xml_text
data.frame(query = q, formatted_address)
})
do.call(rbind, l) # This can be done via data.table::rbindlist as well
有更好的解决方案吗?可能更多data.table
或dplyr
样式?
答案 0 :(得分:0)
我已经使用有效的API密钥编写了包googleway来访问google maps API(因此,如果您的数据超过2,500项,则可以为API密钥付费)。
要获取地址详细信息,请使用google_geocode()
library(googleway)
key <- "your_api_key"
dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany"))
## To get all the data:
res <- apply(dat, 1, function(x){
google_geocode(address = x["query"],
key = key) ## use simplify = F to return JSON
})
## to access the 'formatted address' part, see
res[[1]]$results$formatted_address
# [1] "Washington, DC, USA" "Washington, UT, USA" "Washington, VA 22747, USA" "Washington, IA 52353, USA"
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA"
## so to get everything as a list
lapply(res, function(x){
x$results$formatted_address
})
# [[1]]
# [1] "Washington, DC, USA" "Washington, UT, USA" "Washington, VA 22747, USA" "Washington, IA 52353, USA"
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA"
#
# [[2]]
# [1] "Frankfurt, Germany"
## and to put back onto your original data.frame:
lst <- lapply(1:length(res), function(x){
data.frame(query = dat[x, "query"],
formatted_address = res[[x]]$results$formatted_address)
})
data.table::rbindlist(lst)
# query formatted_address
# 1: Washington, USA Washington, DC, USA
# 2: Washington, USA Washington, UT, USA
# 3: Washington, USA Washington, VA 22747, USA
# 4: Washington, USA Washington, IA 52353, USA
# 5: Washington, USA Washington, GA 30673, USA
# 6: Washington, USA Washington, PA 15301, USA
# 7: Frankfurt, Germany Frankfurt, Germany