标题很难 - 最好用一个例子解释......
在这里,我将虹膜表分别分为140行和10行的train / test:
set.seed(2017)
irisDT <- data.table(iris)
irisDT[, IrisId := .I]
train <- irisDT[sample(.N, 140)]
test <- irisDT[!train, on="IrisId"]
> train
Sepal.Length Sepal.Width Petal.Length Petal.Width Species IrisId
1: 6.0 3.0 4.8 1.8 virginica 139
2: 5.5 2.4 3.8 1.1 versicolor 81
3: 5.6 2.5 3.9 1.1 versicolor 70
4: 4.4 3.2 1.3 0.2 setosa 43
5: 6.8 3.0 5.5 2.1 virginica 113
---
136: 6.3 2.5 5.0 1.9 virginica 147
137: 5.8 2.6 4.0 1.2 versicolor 93
138: 5.1 3.5 1.4 0.3 setosa 18
139: 7.7 3.8 6.7 2.2 virginica 118
140: 6.5 3.0 5.5 1.8 virginica 117
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species IrisId
1: 5.4 3.4 1.7 0.2 setosa 21
2: 5.5 3.5 1.3 0.2 setosa 37
3: 5.0 3.5 1.3 0.3 setosa 41
4: 6.3 3.3 4.7 1.6 versicolor 57
5: 6.6 2.9 4.6 1.3 versicolor 59
6: 6.1 2.8 4.0 1.3 versicolor 72
7: 5.5 2.4 3.7 1.0 versicolor 82
8: 6.7 3.1 4.7 1.5 versicolor 87
9: 6.2 2.8 4.8 1.8 virginica 127
10: 6.9 3.1 5.4 2.1 virginica 140
现在,对于每个test
样本,我想要
train
的匹配物种的平均Petal.Width test
的Petal.Width的train
Petal.Width对于匹配物种的分位数ecdf_fun <- function(vals, x){
# Returns the percentile of x in relation to vals
ecdf(vals)(x)
}
vals <- train[test, on="Species", allow.cartesian=TRUE][, list(
AvgPetal.Width = mean(Petal.Width), QuantilePetal.Width=ecdf_fun(Petal.Width, i.Petal.Width[1])
), keyby=i.IrisId]
test[vals, `:=`(AvgPetal.Width = i.AvgPetal.Width, QuantilePetal.Width = i.QuantilePetal.Width), on=c("IrisId"="i.IrisId")]
> test
Sepal.Length Sepal.Width Petal.Length Petal.Width Species IrisId AvgPetal.Width QuantilePetal.Width
1: 5.4 3.4 1.7 0.2 setosa 21 0.2468 0.6809
2: 5.5 3.5 1.3 0.2 setosa 37 0.2468 0.6809
3: 5.0 3.5 1.3 0.3 setosa 41 0.2468 0.8085
4: 6.3 3.3 4.7 1.6 versicolor 57 1.3244 0.9556
5: 6.6 2.9 4.6 1.3 versicolor 59 1.3244 0.5556
6: 6.1 2.8 4.0 1.3 versicolor 72 1.3244 0.5556
7: 5.5 2.4 3.7 1.0 versicolor 82 1.3244 0.1333
8: 6.7 3.1 4.7 1.5 versicolor 87 1.3244 0.9111
9: 6.2 2.8 4.8 1.8 virginica 127 2.0292 0.3125
10: 6.9 3.1 5.4 2.1 virginica 140 2.0292 0.6458
有没有更好的方法使用data.table?
答案 0 :(得分:2)
嗯,有......
setkey(train, Species, Petal.Width)
# get quantile
test[, q := {
pw = Petal.Width
train[.BY, on=names(.BY), findInterval(pw, Petal.Width)/.N]
}, by=Species]
# get mean
test[train[, mean(Petal.Width), by=Species], on=.(Species), m := i.V1 ]
我想,避免使用笛卡尔连接可以在不耗尽内存的情况下进行扩展。
工作原理
.BY
是一个包含by=
参数中列的列表。有了它,如果您更改代码以使用一组不同的by=
列,则无需重写代码。
X[Y, on=, j]
是联接的语法。