Question

我想从整体中选择满足特定条件的元素子集。大约有20个元素，每个元素都有多个属性。我想从一个属性中选择五个与固定标准差异最小的元素，并在另一个属性上提供最高的平均值。

最后，我想将该功能应用于多组20个元素。

到目前为止，我已经能够“手动”识别子集，但除了返回值本身之外，我还希望能够返回值的索引。

目的：

我想找到X1的五个值的集合，它们与固定值（55）的差异最小，并为X2的平均值提供最大值。
我想为多套装做这件事。

#####  generating example data
#####  this has five groups, each with two variables x1 and x2
set.seed(271828)

grp <- gl(5,20)
x1 <- round(rnorm(100,45, 12), digits=0)
x2 <- round(rbeta(100,2,4), digits = 2)
id <- seq(1,100,1)

#####  this is how the data would arrive for me to analyze
dat <- as.data.frame(cbind(id,grp,x1,x2))

数据将以此格式到达，id作为每个元素的唯一标识符。

#####  pulling out the first group for demonstration
dat.grp.1 <- dat[ which(grp == 1), ]

crit <- 55
x <- t(combn(dat.grp.1$x1, 5))
y <- t(combn(dat.grp.1$x2, 5))

mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2

out <- cbind(x, mean.x, k, y, mean.y)

#####  finding the sets with the least amount of discrepancy
pick <- out[ which(k == min(k)), ]
pick

#####  finding the sets with low discrepancy and high values of y (means of X2) by "hand"
sorted <- out[order(k), ]
head(sorted, n=20)

关于pick中的值，我可以看到X1的值是：

> pick
                    mean.x  k                          mean.y
[1,] 55 47 48 48 52     50 25 0.62 0.08 0.31 0.18 0.54  0.346
[2,] 55 48 48 47 52     50 25 0.62 0.31 0.18 0.48 0.54  0.426

我想返回这些元素的id值，以便我知道我选择元素：3,8,10,11和18（因为与{{1}的差异而选择第2组}是相同的，但k的平均值更高）。

现在“手动”这样做是有效的，但最好让它尽可能“不干涉”。

非常感谢任何帮助。

Answer 1

你快到了。您可以将sorted的定义更改为

sorted <- out[order(k, -mean.y), ]

然后sorted[1,]（或者如果您更喜欢sorted[1,,drop=FALSE]）是您选择的集合。

如果您想要索引而不是/除了点之外，那么您可以包含更早的内容。替换：

x <- t(combn(dat.grp.1$x1, 5))
y <- t(combn(dat.grp.1$x2, 5))

与

idx <- t(combn(1:nrow(dat.grp.1), 5))
x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))

并在idx稍后加入out。

全部放在一起：

#####  pulling out the first group for demonstration
dat.grp.1 <- dat[ which(grp == 1), ]

crit <- 55
idx <- t(combn(1:nrow(dat.grp.1), 5))
x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))

mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2

out <- cbind(idx, x, mean.x, k, y, mean.y)

#####  finding the sets with the least amount of discrepancy and among
##### those the largest second mean
pick <- out[order(k, -mean.y)[1],,drop=FALSE]
pick

给出了

                                 mean.x  k                          mean.y
[1,] 3 8 10 11 18 55 48 48 47 52     50 25 0.62 0.31 0.18 0.48 0.54  0.426

编辑：要求提供超过idx的说明;我想要更多选项，而不仅仅是我在评论中可以做的事情，所以我将它添加到我的答案中。还将解决子集上的循环问题。

idx是一个矩阵（15504 x 5），其中每一行都是数据帧的一组（5）索引。 apply允许逐行（逐行是边距1）对每行执行某些操作。这个值取值并使用它们来索引dat.grp.1的所需行并提取相应的x1值。我可以将dat.grp.1[i,"x1"]写为dat.grp.1$x1[i]。 idx的每一行都成为一列，而dat.grp.1的索引结果就是行，因此需要调换整个行。

如果您愿意，可以将循环拆开以查看每个步骤的工作原理。将函数转换为非匿名函数。

f <- function(i) {dat.grp.1[i,"x1"]}

并在idx的时间传递行。

> f(idx[1,])
[1] 45 27 55 39 41
> f(idx[2,])
[1] 45 27 55 39 29
> f(idx[3,])
[1] 45 27 55 39 47
> f(idx[4,])
[1] 45 27 55 39 48

这些是捆绑到x

的内容

> head(x,4)
     [,1] [,2] [,3] [,4] [,5]
[1,]   45   27   55   39   41
[2,]   45   27   55   39   29
[3,]   45   27   55   39   47
[4,]   45   27   55   39   48

至于循环子集，plyr库对此非常方便。您设置它的方式（将感兴趣的子集分配给变量并使用它）使转换变得容易。您为一个子集创建答案所做的一切都将进入一个函数，该部分作为参数。

find.best.set <- function(dat.grp.1) {
    crit <- 55
    idx <- t(combn(1:nrow(dat.grp.1), 5))
    x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
    y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))

    mean.x <- rowMeans(x)
    mean.y <- rowMeans(y)
    k <- (mean.x - crit)^2

    out <- cbind(idx, x, mean.x, k, y, mean.y)

    out[order(k, -mean.y)[1],,drop=FALSE]
}

这基本上是你以前所拥有的，但摆脱了一些不必要的任务。

现在将其包裹在plyr电话中。

library("plyr")
ddply(dat, .(grp), find.best.set)

给出了

  grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12  V13  V14  V15  V16  V17   V18
1   1  3  8 10 11 18 55 48 48 47  52  50  25 0.62 0.31 0.18 0.48 0.54 0.426
2   2  8 10 12 15 16 53 35 55 76  56  55   0 0.71 0.20 0.43 0.50 0.70 0.508
3   3  4 10 15 17 20 47 48 73 55  52  55   0 0.67 0.54 0.28 0.42 0.31 0.444
4   4  2 11 13 17 19 47 46 70 62  50  55   0 0.35 0.47 0.18 0.13 0.47 0.320
5   5  3  6 10 17 19 72 40 58 66  39  55   0 0.33 0.42 0.32 0.32 0.51 0.380

我不知道这是你的结果的最佳格式，但它反映了你给出的例子。

根据标准选择和识别元素子集

1 个答案: