我有已知X和树的树木生长测量值(直径和高度)的数据。 Y坐标。我想确定每棵树的最近邻居的距离是否等于或大于。
我已经看到其他SE问题询问最近邻居计算(例如,请参阅here,here,here,here等,但是none指定要搜索的最近邻居的约束。
是否有功能(或其他解决方法)允许我确定点的最近邻居的距离,因为最近点符合某些条件 (例如,大小必须等于或大于兴趣点)?
[一组更复杂的约束会更有帮助......]
答案 0 :(得分:2)
我使用非equijoins和data.table
编辑:( fyi,这需要data.table 1.9.7,你可以从github获得)
EDIT2:用data.table的副本做到了,因为它似乎正在加入自己的门槛。我将来会解决这个问题,但现在这个有用了。
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1),
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
根据给定的阈值
,最终数据集包含每对如果你想加入其他参数,这可以让你这样做,(如果你另外加入情节或物种之类的东西,它可能会更快,因为笛卡尔连接会更小)
以下是加入另外两个分类变量(物种和情节)的示例:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species == species,
plot == plot),
nomatch = NA,
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final
> final
id x y height species plot height.1 i.id i.x i.y i.height dist
1: 3 0.4837348 0.4325731 91.53387 C 2 111.53387 486 0.5549221 0.4395687 101.53387 0.005116568
2: 13 0.8267298 0.3137061 94.58949 C 2 114.58949 754 0.8408547 0.2305702 104.58949 0.007111079
3: 29 0.2905729 0.4952757 89.52128 C 2 109.52128 333 0.2536760 0.5707272 99.52128 0.007054301
4: 37 0.4534841 0.5249862 89.95493 C 2 109.95493 72 0.4807242 0.6056771 99.95493 0.007253044
5: 63 0.1678515 0.8814829 84.77450 C 2 104.77450 289 0.1151764 0.9728488 94.77450 0.011122404
---
994: 137 0.8696393 0.2226888 66.57792 C 2 86.57792 473 0.4467795 0.6881008 76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466 A 2 130.14466 338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821 A 2 133.61821 348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652 71.39953 C 3 91.39953 904 0.6954973 0.3690599 81.39953 0.112536771
998: 172 0.6834473 0.6221259 65.52187 A 2 85.52187 783 0.4400028 0.9526355 75.52187 0.168501816
>
注意:在最终答案中,有列height和height.1,后者似乎是由data.table的equi连接产生的,分别代表上下边界。
@theforestecologist的一个问题是,这需要大量的记忆,
(在这种情况下,有另外42列乘以笛卡尔连接,导致mem问题),
但是,我们可以通过使用.EACHI(我相信)以更高效的内存方式执行此操作。因为我们不会将整个表加载到内存中。该解决方案如下:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]
# for clarity inside the brackets,
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2
# Join on a range, must be a cartesian join, since there are many candidates
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species,
plot),
.(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
by = .EACHI,
nomatch = NA,
allow.cartesian = TRUE]
# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]
> test
id x y height species plot i.id i.x i.y i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
1: 1 0.17622235 0.66547312 84.68450 B 2 965 0.17410840 0.63219350 93.60226 B 2 74.68450 94.68450 B 2
2: 2 0.04523011 0.33813054 89.46288 B 2 457 0.07267547 0.35725229 88.42827 B 2 79.46288 99.46288 B 2
3: 3 0.24096368 0.32649256 103.85870 C 3 202 0.20782303 0.38422814 94.35898 C 3 93.85870 113.85870 C 3
4: 4 0.53160655 0.06636979 101.50614 B 1 248 0.47382417 0.01535036 103.74101 B 1 91.50614 111.50614 B 1
5: 5 0.83426727 0.55380451 101.93408 C 3 861 0.78210747 0.52812487 96.71422 C 3 91.93408 111.93408 C 3
这样我们应该保持较低的总内存使用率。