我希望计算数据子集的一列的平均值,并将此平均值输入到整个数据的新列中。
这里有一些代码可以让事情更加清晰:
t <- data.table(Label=c(0,1,0,1,1,1), x=c("aa","aa","aa","aa","bb","bb"), environment=c("train","train","test","test","train","test"))
t
Label x environment
1: 0 aa train
2: 1 aa train
3: 0 aa test
4: 1 aa test
5: 1 bb train
6: 1 bb test
setkey(t,x)
t[environment=="train",avg := mean(Label),by=c("x")]
t
Label x environment avg
1: 0 aa train 0.5
2: 1 aa train 0.5
3: 0 aa test NA
4: 1 aa test NA
5: 1 bb train 1.0
6: 1 bb test NA
上面的代码工作,除了它不更新环境==“test”的行,这是正常的,因为我在子集上做了除了那些的平均值。
所以我想保留子集的均值,但更新所有行的avg列,包括“test”。
所以结果应该是:
t
Label x environment avg
1: 0 aa train 0.5
2: 1 aa train 0.5
3: 0 aa test 0.5 # average calculated with train rows only
4: 1 aa test 0.5 # average calculated with train rows only
5: 1 bb train 1.0
6: 1 bb test 1.0 # average calculated with train rows only
答案 0 :(得分:5)
似乎这就是你之后的
t[environment == "train", avg := mean(Label), by = x][, avg := mean(avg, na.rm = T), by= x]
t
## Label x environment avg
## 1: 0 aa train 0.5
## 2: 1 aa train 0.5
## 3: 0 aa test 0.5
## 4: 1 aa test 0.5
## 5: 1 bb train 1.0
## 6: 1 bb test 1.0
答案 1 :(得分:2)
您可以使用data.table
来解决这个问题,但是我获得所需答案的最快捷,最方便的方法是使用na.locf function from zoo
require(data.table)
require(zoo)
t <- data.table(Label=c(0,1,0,1,1,1), x=c("aa","aa","aa","aa","bb","bb"), environment=c("train","train","test","test","train","test"))
t[environment=="train",avg := mean(Label),by=c("x")]
t[,avg:=na.locf(avg),by=c("x")]
只是为了显示它的工作原理我添加了一个额外的无序测试用例,其标签值为5(使得按组分隔的方法大不相同)。这是我得到的输出。
t <- data.table(Label=c(0,1,0,1,1,1,5), x=c("aa","aa","aa","aa","bb","bb","aa"), environment=c("train","train","test","test","train","test","test"))
t[environment=="train",avg := mean(Label),by=c("x")]
t[,avg:=na.locf(avg),by=c("x")]
t
Label x environment avg
1: 0 aa train 0.5
2: 1 aa train 0.5
3: 0 aa test 0.5
4: 1 aa test 0.5
5: 1 bb train 1.0
6: 1 bb test 1.0
7: 5 aa test 0.5