Question

标题很难 - 最好用一个例子解释......

在这里，我将虹膜表分别分为140行和10行的train / test：

set.seed(2017)
irisDT <- data.table(iris)
irisDT[, IrisId := .I]
train <- irisDT[sample(.N, 140)]
test <- irisDT[!train, on="IrisId"]

> train
     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species IrisId
  1:          6.0         3.0          4.8         1.8  virginica    139
  2:          5.5         2.4          3.8         1.1 versicolor     81
  3:          5.6         2.5          3.9         1.1 versicolor     70
  4:          4.4         3.2          1.3         0.2     setosa     43
  5:          6.8         3.0          5.5         2.1  virginica    113
 ---                                                                    
136:          6.3         2.5          5.0         1.9  virginica    147
137:          5.8         2.6          4.0         1.2 versicolor     93
138:          5.1         3.5          1.4         0.3     setosa     18
139:          7.7         3.8          6.7         2.2  virginica    118
140:          6.5         3.0          5.5         1.8  virginica    117

> test
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species IrisId
 1:          5.4         3.4          1.7         0.2     setosa     21
 2:          5.5         3.5          1.3         0.2     setosa     37
 3:          5.0         3.5          1.3         0.3     setosa     41
 4:          6.3         3.3          4.7         1.6 versicolor     57
 5:          6.6         2.9          4.6         1.3 versicolor     59
 6:          6.1         2.8          4.0         1.3 versicolor     72
 7:          5.5         2.4          3.7         1.0 versicolor     82
 8:          6.7         3.1          4.7         1.5 versicolor     87
 9:          6.2         2.8          4.8         1.8  virginica    127
10:          6.9         3.1          5.4         2.1  virginica    140

现在，对于每个test样本，我想要

来自train的匹配物种的平均Petal.Width
基于来自test的Petal.Width的train Petal.Width对于匹配物种的分位数

丑陋的解决方案

ecdf_fun <- function(vals, x){
  # Returns the percentile of x in relation to vals

  ecdf(vals)(x)
}

vals <- train[test, on="Species", allow.cartesian=TRUE][, list(
  AvgPetal.Width = mean(Petal.Width), QuantilePetal.Width=ecdf_fun(Petal.Width, i.Petal.Width[1])
  ), keyby=i.IrisId]
test[vals, `:=`(AvgPetal.Width = i.AvgPetal.Width, QuantilePetal.Width = i.QuantilePetal.Width), on=c("IrisId"="i.IrisId")]

> test
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species IrisId AvgPetal.Width QuantilePetal.Width
 1:          5.4         3.4          1.7         0.2     setosa     21         0.2468              0.6809
 2:          5.5         3.5          1.3         0.2     setosa     37         0.2468              0.6809
 3:          5.0         3.5          1.3         0.3     setosa     41         0.2468              0.8085
 4:          6.3         3.3          4.7         1.6 versicolor     57         1.3244              0.9556
 5:          6.6         2.9          4.6         1.3 versicolor     59         1.3244              0.5556
 6:          6.1         2.8          4.0         1.3 versicolor     72         1.3244              0.5556
 7:          5.5         2.4          3.7         1.0 versicolor     82         1.3244              0.1333
 8:          6.7         3.1          4.7         1.5 versicolor     87         1.3244              0.9111
 9:          6.2         2.8          4.8         1.8  virginica    127         2.0292              0.3125
10:          6.9         3.1          5.4         2.1  virginica    140         2.0292              0.6458

有没有更好的方法使用data.table？

Answer 1

嗯，有......

setkey(train, Species, Petal.Width)

# get quantile
test[, q := {
  pw = Petal.Width
  train[.BY, on=names(.BY), findInterval(pw, Petal.Width)/.N]
}, by=Species]

# get mean
test[train[, mean(Petal.Width), by=Species], on=.(Species), m := i.V1 ]

我想，避免使用笛卡尔连接可以在不耗尽内存的情况下进行扩展。

工作原理

.BY是一个包含by=参数中列的列表。有了它，如果您更改代码以使用一组不同的by=列，则无需重写代码。

X[Y, on=, j]是联接的语法。

对于data.table中的每一行，获取相对于来自另一个表的匹配行的值的分位数

丑陋的解决方案

1 个答案: