Question

我正在研究Gowalla location-based checkin dataset，其中有大约640万签到。这些签到的独特位置是128万。但是Gowalla只给出了纬度和经度。所以我需要找到每个城市，州和国家。从StackOverflow上的另一篇文章中，我能够创建下面的R查询，查询打开的街道地图并查找相关的地理详细信息。

不幸的是，处理125行需要大约1分钟，这意味着128万行需要几天时间。有没有更快的方法来找到这些细节？也许有一些包含内置拉特和世界各大城市的套餐可以找到给定纬度的城市名称，所以我不需要进行在线查询。

场地表是一个包含3列的数据框：1: vid(venueId), 2 lat(latitude), 3: long(longitude)

for(i in 1:nrow(venueTable)) {
 #this is just an indicator to display current value of i on screen
 cat(paste(".",i,".")) 

 #Below code composes the url query 
 url <- paste("http://nominatim.openstreetmap.org/reverse.php? format=json&lat=",
              venueTableTest3$lat[i],"&lon=",venueTableTest3$long[i])
 url <- gsub(' ','',url)
 url <- paste(url)
 x <- fromJSON(url)
 venueTableTest3$display_name[i] <- x$display_name
 venueTableTest3$country[i] <- x$address$country
}

我正在使用R中的jsonlite包，这使得x这是JSON查询的结果，作为存储返回的各种结果的数据帧。因此，使用x$display_name或x$address$city我会使用我的必填字段。

我的笔记本电脑是核心i5 3230M，配备8GB RAM和120gb SSD，使用Windows 8。

Answer 1

这是使用R固有空间处理功能的另一种方法：

library(sp)
library(rgeos)
library(rgdal)

# world places shapefile
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip"
fil1 <- basename(URL1)
if (!file.exists(fil1)) download.file(URL1, fil1)
unzip(fil1)

places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places",
                  stringsAsFactors=FALSE)

# some data from the other answer since you didn't provide any
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv"
fil2 <- basename(URL2)
if (!file.exists(fil2)) download.file(URL2, fil2)

# we need the points from said dat
dat <- read.csv(fil2, stringsAsFactors=FALSE)
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places)))

# this is not necessary
# I just don't like the warning about longlat not being a real projection
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
pts <- spTransform(pts, robin)
places <- spTransform(places, robin)

# compute the distance (makes a pretty big matrix so you should do this
# in chunks unless you have a ton of memory or do it row-by-row
far <- gDistance(pts, places, byid=TRUE)

# find the closest one
closest <- apply(far, 1, which.min)

# map to the fields (you may want to map to other fields)
locs <- places@data[closest, c("NAME", "ADM1NAME", "ISO_A2")]

locs[sample(nrow(locs), 10),]

##              NAME        ADM1NAME ISO_A2
## 3274     Szczecin West Pomeranian     PL
## 1039     Balakhna      Nizhegorod     RU
## 1012       Chitre         Herrera     PA
## 3382     L'Aquila         Abruzzo     IT
## 1982       Dothan         Alabama     US
## 5159 Bayankhongor     Bayanhongor     MN
## 620        Deming      New Mexico     US
## 1907   Fort Smith        Arkansas     US
## 481      Dedougou        Mou Houn     BF
## 7169       Prague          Prague     CZ

大约一分钟（在我的系统上）约为7500，所以你看一两个小时或一天以上。你可以并行执行此操作，并在不到一个小时的时间内完成它。

为了获得更好的地点分辨率，您可以使用非常轻量级的国家/地区或管理1多边形的shapefile，然后使用第二个流程来为这些地理位置的更好分辨率点建立距离。

更快地处理来自大型数据帧的120万个JSON地理定位查询

1 个答案: