R函数在给定[不一致]约束的情况下计算最近邻距离?

时间:2016-10-10 16:56:05

标签: r constraints distance nearest-neighbor euclidean-distance

我有已知X和树的树木生长测量值(直径和高度)的数据。 Y坐标。我想确定每棵树的最近邻居的距离是否等于或大于

我已经看到其他SE问题询问最近邻居计算(例如,请参阅herehereherehere等,但是none指定要搜索的最近邻居的约束。

是否有功能(或其他解决方法)允许我确定点的最近邻居的距离,因为最近点符合某些条件 (例如,大小必须等于或大于兴趣点)

[一组更复杂的约束会更有帮助......]

  • 就我的例子而言:指定一棵树必须与感兴趣的树位于同一地块或与感兴趣的树是同一物种

1 个答案:

答案 0 :(得分:2)

我使用非equijoins和data.table

编辑:( fyi,这需要data.table 1.9.7,你可以从github获得)

EDIT2:用data.table的副本做到了,因为它似乎正在加入自己的门槛。我将来会解决这个问题,但现在这个有用了。

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1), 
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]

根据给定的阈值

,最终数据集包含每对

编辑:

附加变量:

如果你想加入其他参数,这可以让你这样做,(如果你另外加入情节或物种之类的东西,它可能会更快,因为笛卡尔连接会更小)

以下是加入另外两个分类变量(物种和情节)的示例:

 library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1,
                            species == species,
                            plot == plot),
              nomatch = NA,
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final

> final
      id         x         y    height species plot  height.1 i.id       i.x       i.y  i.height        dist
  1:   3 0.4837348 0.4325731  91.53387       C    2 111.53387  486 0.5549221 0.4395687 101.53387 0.005116568
  2:  13 0.8267298 0.3137061  94.58949       C    2 114.58949  754 0.8408547 0.2305702 104.58949 0.007111079
  3:  29 0.2905729 0.4952757  89.52128       C    2 109.52128  333 0.2536760 0.5707272  99.52128 0.007054301
  4:  37 0.4534841 0.5249862  89.95493       C    2 109.95493   72 0.4807242 0.6056771  99.95493 0.007253044
  5:  63 0.1678515 0.8814829  84.77450       C    2 104.77450  289 0.1151764 0.9728488  94.77450 0.011122404
 ---                                                                                                        
994: 137 0.8696393 0.2226888  66.57792       C    2  86.57792  473 0.4467795 0.6881008  76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466       A    2 130.14466  338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821       A    2 133.61821  348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652  71.39953       C    3  91.39953  904 0.6954973 0.3690599  81.39953 0.112536771
998: 172 0.6834473 0.6221259  65.52187       A    2  85.52187  783 0.4400028 0.9526355  75.52187 0.168501816
> 

注意:在最终答案中,有列height和height.1,后者似乎是由data.table的equi连接产生的,分别代表上下边界。

Mem-efficient解决方案

@theforestecologist的一个问题是,这需要大量的记忆,

(在这种情况下,有另外42列乘以笛卡尔连接,导致mem问题),

但是,我们可以通过使用.EACHI(我相信)以更高效的内存方式执行此操作。因为我们不会将整个表加载到内存中。该解决方案如下:

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]


# for clarity inside the brackets, 
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2 

# Join on a range, must be a cartesian join, since there are many candidates 
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2, 
                                 height <= thresh1,
                                 species,
                                 plot),
              .(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
              by = .EACHI,
              nomatch = NA,
              allow.cartesian = TRUE]

# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]

> test
        id          x          y    height species plot i.id        i.x        i.y  i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
   1:    1 0.17622235 0.66547312  84.68450       B    2  965 0.17410840 0.63219350  93.60226         B      2   74.68450   94.68450           B        2
   2:    2 0.04523011 0.33813054  89.46288       B    2  457 0.07267547 0.35725229  88.42827         B      2   79.46288   99.46288           B        2
   3:    3 0.24096368 0.32649256 103.85870       C    3  202 0.20782303 0.38422814  94.35898         C      3   93.85870  113.85870           C        3
   4:    4 0.53160655 0.06636979 101.50614       B    1  248 0.47382417 0.01535036 103.74101         B      1   91.50614  111.50614           B        1
   5:    5 0.83426727 0.55380451 101.93408       C    3  861 0.78210747 0.52812487  96.71422         C      3   91.93408  111.93408           C        3

这样我们应该保持较低的总内存使用率。