我正在尝试复制真正cool nearest neighbor question中完成的工作,但是对于我的数据框中的每个区域而不是整个组都这样做。
我的数据ncbaby(不要问)看起来像这样:
top.add(gr)
我想为每个areaname运行该函数。我试过调用split但是距离函数不会调用列表。
id printid areaname latitude longitude
1 7912048 233502729 073 36.06241 -80.44229
2 735253 171241999 Area 12-06 35.54452 -78.75388
3 4325564 85564887 Area 12-04 35.49328 -78.73756
4 4222241 85461255 Area 12-06 35.53621 -78.75553
5 11997754 356053648 Area 12-04 35.49328 -78.73756
6 13444458 536073775 Area 12-06 35.53987 -78.74922
我最接近的是:
splitfile <- split(ncbaby, ncbaby$precinctname)
c <- gDistance(splitfile, byid=TRUE)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘is.projected’ for signature ‘"list"’
这里的问题是它最终只踢出最后返回的值。想法?我也有兴趣/开放修改功能,因为它似乎可能更有效。
答案 0 :(得分:3)
我检查了链接的帖子并稍微修改了一下这个想法。我认为使用apply()
对于大型数据集可能不是一个好主意。所以我宁愿使用data.table相关的方法。首先,我将我的示例数据转换为SpatialPointsDataFrame。然后,我通过组变量(即组)分割数据。正如Eddie建议的那样,我使用了lapply()
和data.table函数。使用gDistance()
时,您有一个二维向量作为输出。我将其转换为data.table对象,以便后续数据进程可能更快。我用melt()
重新塑造了dt对象并删除了距离= 0的所有数据点。最后,我为每个Var1
取了第一行。请注意,Var1
此处代表示例数据的每一行mydf
。最后一项工作是将新的距离矢量添加到原始数据帧。我希望这会对你有所帮助。
DATA
group user_id latitude longitude
1 B23 85553 -34.44011 172.6954
2 B23 85553 -34.43929 172.6939
3 B23 85553 -34.43929 172.6939
4 B23 85553 -34.43851 172.6924
5 B23 57357 -34.42747 172.6778
6 B23 57357 -34.42747 172.6778
7 B23 57357 -34.42747 172.6778
8 B23 98418 -34.43119 172.7014
9 B23 98418 -34.43225 172.7023
10 B23 98418 -34.43224 172.7023
11 B23 98418 -34.43224 172.7023
12 B24 57357 -34.43647 172.7141
13 B24 57357 -34.43647 172.7141
14 B24 57357 -34.43647 172.7141
15 B24 98418 -34.43904 172.7172
16 B24 98418 -34.43904 172.7172
17 B24 98418 -34.43904 172.7172
18 B24 98418 -34.43925 172.7168
19 B24 98418 -34.43915 172.7169
20 B24 98418 -34.43915 172.7169
21 B24 98418 -34.43915 172.7169
22 B24 98418 -34.43915 172.7169
CODE
library(sp)
library(rgeos)
library(data.table)
# Copy the original
temp <- mydf
#DF to SPDF
coordinates(temp) <- ~longitude+latitude
# Split the data by a group variable
mylist <- split(temp, f = temp$group)
#For each element in mylist, apply gDistance, reshape the output of
# gDistance and create a data.table. Then, reshape the data, remove
# rows with distance = 0. Finally, choose the first row for each
# variable. levels in variable represents rows in mydf.
out <- rbindlist(
lapply(mylist, function(x){
d <- setDT(melt(gDistance(x, byid = TRUE)))
setorder(d, Var1, value)
d <- d[value > 0]
d <- d[, .SD[1], by = Var1]
d
})
)
out <- cbind(mydf, distance = out$value)
# group user_id latitude longitude distance
#1 B23 85553 -34.44011 172.6954 1.743945e-03
#2 B23 85553 -34.43929 172.6939 1.661118e-03
#3 B23 85553 -34.43929 172.6939 1.661118e-03
#4 B23 85553 -34.43851 172.6924 1.661118e-03
#5 B23 57357 -34.42747 172.6778 1.836642e-02
#6 B23 57357 -34.42747 172.6778 1.836642e-02
#7 B23 57357 -34.42747 172.6778 1.836642e-02
#8 B23 98418 -34.43119 172.7014 1.369100e-03
#9 B23 98418 -34.43225 172.7023 1.456022e-05
#10 B23 98418 -34.43224 172.7023 1.456022e-05
#11 B23 98418 -34.43224 172.7023 1.456022e-05
#12 B24 57357 -34.43647 172.7141 3.862696e-03
#13 B24 57357 -34.43647 172.7141 3.862696e-03
#14 B24 57357 -34.43647 172.7141 3.862696e-03
#15 B24 98418 -34.43904 172.7172 3.245705e-04
#16 B24 98418 -34.43904 172.7172 3.245705e-04
#17 B24 98418 -34.43904 172.7172 3.245705e-04
#18 B24 98418 -34.43925 172.7168 1.393162e-04
#19 B24 98418 -34.43915 172.7169 1.393162e-04
#20 B24 98418 -34.43915 172.7169 1.393162e-04
#21 B24 98418 -34.43915 172.7169 1.393162e-04
#22 B24 98418 -34.43915 172.7169 1.393162e-04
dput()中的数据
mydf <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("B23",
"B24"), class = "factor"), user_id = c(85553L, 85553L, 85553L,
85553L, 57357L, 57357L, 57357L, 98418L, 98418L, 98418L, 98418L,
57357L, 57357L, 57357L, 98418L, 98418L, 98418L, 98418L, 98418L,
98418L, 98418L, 98418L), latitude = c(-34.440114, -34.43929,
-34.43929, -34.438507, -34.427467, -34.427467, -34.427467, -34.431187,
-34.432254, -34.43224, -34.43224, -34.436472, -34.436472, -34.436472,
-34.439038, -34.439038, -34.439038, -34.439246, -34.439149, -34.439149,
-34.439149, -34.439149), longitude = c(172.695443, 172.693906,
172.693906, 172.692441, 172.677763, 172.677763, 172.677763, 172.701413,
172.702284, 172.702288, 172.702288, 172.71411, 172.71411, 172.71411,
172.717203, 172.717203, 172.717203, 172.716798, 172.716898, 172.716898,
172.716898, 172.716898)), .Names = c("group", "user_id", "latitude",
"longitude"), row.names = c(NA, -22L), class = "data.frame")
答案 1 :(得分:0)
这是一个解决方案,它使链接帖子中的解决方案适应区域分组。首先,我们定义两个函数:
library(sp)
library(rgeos)
nearest.neighbor <- function(lon,lat) {
df <- data.frame(lon,lat)
coordinates(df) <- ~lon+lat
d <- gDistance(df, byid=TRUE)
# remove the self distance from being considered and use which.min to find the nearest neighbor
d[cbind(1:nrow(d),1:nrow(d))] <- NA
min.d <- rbind(apply(d,1,function(x) {ind <- which.min(x); list(ind=ind,distance=x[ind])}))
}
order.by.ind <- function (x,ind) x[ind]
nearest.neighbor
函数紧跟链接帖子中的代码,但它返回列表向量。每个列表包含最近邻居的索引和到该邻居的距离。这里的关键是我们只想计算距离一次返回最小距离和相应的索引。请注意,我们通过将d
的对角线设置为NA
来删除自考距离,然后使用which.min
找到最近的邻居,从而避免必须进行完整排序。< / p>
order.by.ind
函数只是根据索引x
重新排序输入列ind
。
使用这两个功能,我们可以使用mutate
包中的dplyr
来计算按areaname
分组的所需列:
library(dplyr)
result <- ncbaby %>% group_by(areaname) %>%
mutate(min.d=nearest.neighbor(longitude, latitude)) %>%
mutate_each(vars=c(id, printid, latitude, longitude),
funs(order.by.ind, "order.by.ind", order.by.ind(.,ind=unlist(min.d)[c(TRUE,FALSE)]))) %>%
mutate(distance=unlist(min.d)[c(FALSE,TRUE)]) %>%
mutate(.Areaname=areaname) %>%
select(-min.d)
newvars <- c('n.ID', 'n.printid', 'n.latitude', 'n.longitude', 'distance', '.Areaname')
colnames(result) <- c(colnames(ncbaby), newvars)
注意:
mutate
创建一个临时列min.d
,其中包含ind
和distance
到最近邻居的列表。这是该地区最近的邻居,因为我们group_by
areaname
。mutate_each
通过根据vars
重新排序该列,为ind
中的每个变量创建一个新列。请注意,我们通过取消列出来从ind
中提取min.d
,然后使用[c(TRUE,FALSE)]
提取奇数元素。mutate
通过从distance
中提取distance
来创建min.d
列。同样,这是通过不列出,然后使用[c(FALSE,TRUE)]
提取偶数元素。mutate
,因为.Areaname
列在结果中对areaname
来说是多余的。min.d
列,并根据需要设置结果数据框的列名。使用您的数据的结果是:
print(result)
##Source: local data frame [7 x 11]
##Groups: areaname [3]
##
## id printid areaname latitude longitude n.ID n.printid n.latitude n.longitude distance .Areaname
## <int> <int> <fctr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <fctr>
##1 7912048 233502729 073 36.06241 -80.44229 7912049 233502730 36.06251 -80.44329 0.001004988 073
##2 735253 171241999 Area 12-06 35.54452 -78.75388 13444458 536073775 35.53987 -78.74922 0.006583168 Area 12-06
##3 4325564 85564887 Area 12-04 35.49328 -78.73756 11997754 356053648 35.49328 -78.73756 0.000000000 Area 12-04
##4 4222241 85461255 Area 12-06 35.53621 -78.75553 13444458 536073775 35.53987 -78.74922 0.007294635 Area 12-06
##5 11997754 356053648 Area 12-04 35.49328 -78.73756 4325564 85564887 35.49328 -78.73756 0.000000000 Area 12-04
##6 13444458 536073775 Area 12-06 35.53987 -78.74922 735253 171241999 35.54452 -78.75388 0.006583168 Area 12-06
##7 7912049 233502730 073 36.06251 -80.44329 7912048 233502729 36.06241 -80.44229 0.001004988 073
我为areaname="073"
添加了一个新行,以便每个区域至少有两行。