我在R,街道和崩溃中有两个data.table R对象。在描述下面:
head(streets)
link_id Lat Long
1: 706815684 44.13163 9.84736
2: 572513298 46.87760 15.77544
3: 974462021 41.86439 16.04506
4: 906821226 43.30472 11.59198
5: 537724528 46.30359 7.59026
6: 1062652524 44.83993 19.08552
和
head(crashes)
ID_SX Lat Long
1: rca89123 45.35955 9.64950
2: rca89654 37.07544 15.28659
3: rca83674 44.42947 8.89526
4: lcg55792 38.08756 13.53466
5: lcg11992 41.81531 12.45126
6: iix21744 38.02655 12.88128
我想附加到崩溃数据集中,来自街道data.framewhere的link_id是最小的距离(来自R geospere包):
我试图使用此代码段,但失败了:
temp=crashes[streets(hdist=geosphere::distm(c(x.Long,x.Lat),c(i.Long,i.Lat),fun=distHaversine)),allow.cartesian=T]
请注意,街道数据集非常大(大约9Mln行),而崩溃非常小(大约400行)。我相信,在R中,只有data.table可以很好地处理这个问题,但不知道如何......
提前感谢您的支持
答案 0 :(得分:1)
为了避免9 M行x 400行的笛卡尔连接,我们可以尝试使用 non-equi join 来缩小候选列表。
这个想法是缩小附近区域的范围。对于每个崩溃站点,通过选择Lat
和Long
在每个崩溃站点周围的给定增量内的街道。然后,我们只需计算附近街道的距离,找到最小距离。
这是我尝试使用提供的数据:
library(data.table)
# define +/- deltas for non-equi join ("area of vicinity")
d_lat <- 2.0
d_lon <- 2.0
streets[crashes[, .(ID_SX, Lat, Long,
# create lower and upper bounds
lb.lat = Lat - d_lat, ub.lat = Lat + d_lat,
lb.lon = Long - d_lon, ub.lon = Long + d_lon)],
# non-equi join conditions
on = .(Lat > lb.lat, Lat < ub.lat, Long > lb.lon, Long < ub.lon),
.(link_id, x.Lat, x.Long, ID_SX, i.Lat, i.Long)][
# compute distance for each row
, hdist := geosphere::distm(c(x.Long,x.Lat),c(i.Long,i.Lat),fun=distHaversine),
by = .(link_id, ID_SX)][
# find minimum for each crash site
, .SD[which.min(hdist)], by = ID_SX]
ID_SX link_id x.Lat x.Long i.Lat i.Long hdist 1: rca89123 706815684 44.13163 9.84736 45.35955 9.64950 137583.53 2: rca83674 706815684 44.13163 9.84736 44.42947 8.89526 82806.14 3: lcg11992 906821226 43.30472 11.59198 41.81531 12.45126 180146.65
请注意,并非所有崩溃站点都在附近的区域内找到街道&#34;。这是由少数街道造成的。
出于生产目的,需要调整d_lat
和d_lon
(尽可能小以减少运行时间和内存消耗,但需要尽可能大,以便为每个崩溃站点查找街道)。
library(data.table)
streets <- fread(
"i link_id Lat Long
1: 706815684 44.13163 9.84736
2: 572513298 46.87760 15.77544
3: 974462021 41.86439 16.04506
4: 906821226 43.30472 11.59198
5: 537724528 46.30359 7.59026
6: 1062652524 44.83993 19.08552", drop = 1L)
crashes <- fread(
"i ID_SX Lat Long
1: rca89123 45.35955 9.64950
2: rca89654 37.07544 15.28659
3: rca83674 44.42947 8.89526
4: lcg55792 38.08756 13.53466
5: lcg11992 41.81531 12.45126
6: iix21744 38.02655 12.88128", drop = 1L)