r表示整个数据的数据更新列的子集

时间:2014-07-23 14:02:36

标签: r data.table

我希望计算数据子集的一列的平均值,并将此平均值输入到整个数据的新列中。

这里有一些代码可以让事情更加清晰:

t <- data.table(Label=c(0,1,0,1,1,1), x=c("aa","aa","aa","aa","bb","bb"), environment=c("train","train","test","test","train","test"))
t
   Label  x environment
1:     0 aa       train
2:     1 aa       train
3:     0 aa        test
4:     1 aa        test
5:     1 bb       train
6:     1 bb        test
setkey(t,x)
t[environment=="train",avg := mean(Label),by=c("x")]

t
   Label  x environment avg
1:     0 aa       train 0.5
2:     1 aa       train 0.5
3:     0 aa        test  NA
4:     1 aa        test  NA
5:     1 bb       train 1.0
6:     1 bb        test  NA

上面的代码工作,除了它不更新环境==“test”的行,这是正常的,因为我在子集上做了除了那些的平均值。

所以我想保留子集的均值,但更新所有行的avg列,包括“test”。

所以结果应该是:

t
   Label  x environment avg
1:     0 aa       train 0.5
2:     1 aa       train 0.5
3:     0 aa        test 0.5 # average calculated with train rows only
4:     1 aa        test 0.5 # average calculated with train rows only
5:     1 bb       train 1.0
6:     1 bb        test 1.0 # average calculated with train rows only

2 个答案:

答案 0 :(得分:5)

似乎这就是你之后的

t[environment == "train", avg := mean(Label), by = x][, avg := mean(avg, na.rm = T), by= x]
t 

##   Label  x environment avg
## 1:     0 aa       train 0.5
## 2:     1 aa       train 0.5
## 3:     0 aa        test 0.5
## 4:     1 aa        test 0.5
## 5:     1 bb       train 1.0
## 6:     1 bb        test 1.0

答案 1 :(得分:2)

您可以使用data.table来解决这个问题,但是我获得所需答案的最快捷,最方便的方法是使用na.locf function from zoo

require(data.table)
require(zoo)
t <- data.table(Label=c(0,1,0,1,1,1), x=c("aa","aa","aa","aa","bb","bb"), environment=c("train","train","test","test","train","test"))

t[environment=="train",avg := mean(Label),by=c("x")]
t[,avg:=na.locf(avg),by=c("x")]

只是为了显示它的工作原理我添加了一个额外的无序测试用例,其标签值为5(使得按组分隔的方法大不相同)。这是我得到的输出。

 t <- data.table(Label=c(0,1,0,1,1,1,5), x=c("aa","aa","aa","aa","bb","bb","aa"), environment=c("train","train","test","test","train","test","test"))

 t[environment=="train",avg := mean(Label),by=c("x")]
 t[,avg:=na.locf(avg),by=c("x")]
 t
    Label  x environment avg
 1:     0 aa       train 0.5
 2:     1 aa       train 0.5
 3:     0 aa        test 0.5
 4:     1 aa        test 0.5
 5:     1 bb       train 1.0
 6:     1 bb        test 1.0
 7:     5 aa        test 0.5