我正在处理一个大数据集,并且在数据清理方面遇到了问题。我的数据集如下所示:
data <- cbind (group = c(1,1,1,2,2,3,3,3,4,4,4,4,4),
member = c(1,2,3,1,2,1,2,3,1,2,3,4,5),
score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))
我只想保留得分总和等于1的组,并删除得分总和等于0的整个组。对于得分总和大于1的组例如,得分总和= 3,我想随机选择两个得分等于1的组成员,并从组中删除它们。然后数据可能如下所示:
newdata <- cbind (group = c(1,1,1,3,3,4,4,4),
member = c(1,2,3,2,3,1,3,5),
score = c(0,1,0,0,1,0,1,0))
有人可以帮我完成这件事吗?
答案 0 :(得分:1)
我会定义一个能够达到你想要的功能。然后使用ddply
并按group
分割。
myfun <- function(x) {
if(sum(x$score)==1) {
return(x)
} else if(sum(x$score)==0) {
return(data.frame())
} else {
row.names(x) <- NULL
score.1 <- sample(as.integer(row.names(x[x$score==1,])), nrow(x[x$score==1,])-1)
return(x[-score.1,])
}
}
library(plyr)
ddply(as.data.frame(dat), .(group), myfun)
group member score
1 1 1 0
2 1 2 1
3 1 3 0
4 3 1 1
5 4 1 0
6 4 2 1
7 4 3 1
答案 1 :(得分:1)
我会编写一个功能,结合各种操作。这是一个这样的功能,评论很多:
process <- function(x) {
## this adds a vector with the group sum score
x <- within(x, sumScore <- ave(score, group, FUN = sum))
## drop the group with sumScore == 0
x <- x[-which(x$sumScore == 0L), , drop = FALSE]
## choose groups with sumScore > 1
## sample sumScore - 1 of the rows where score == 1L
foo <- function(x) {
scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
## which of the grups observations have score = 1L
want <- which(x$score == 1L)
## want to sample all bar one of these
want <- sample(want, scr-1)
## remove the selected rows & retun
x[-want, , drop = FALSE]
}
## which rows are samples with group sumScore > 1
want <- which(x$sumScore > 1L)
## select only those samples, split up those samples by group, lapplying foo
## to each group, then rbind the resulting data frames together
newX <- do.call(rbind,
lapply(split(x[want, , drop = FALSE], x[want, "group"]),
FUN = foo))
## bind the sampled sumScore > 1L on to x (without sumScore > 1L)
newX <- rbind(x[-want, , drop = FALSE], newX)
## remove row labels
rownames(newX) <- NULL
## return the data without the sumScore column
newX[, 1:3]
}
与您的数据相关:
dat <- data.frame(group = c(1,1,1,2,2,3,3,3,4,4,4,4,4),
member = c(1,2,3,1,2,1,2,3,1,2,3,4,5),
score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))
给出:
> set.seed(42)
> process(dat)
group member score
1 1 1 0
2 1 2 1
3 1 3 0
4 3 1 1
5 3 2 0
6 4 1 0
7 4 3 1
8 4 5 0
我认为我想要的是什么。
更新:在上面process()
中,内部函数foo()
可以重写为仅采样1行并删除其他行。即用下面的foo()
替换foo <- function(x) {
scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
## which of the grups observations have score = 1L
want <- which(x$score == 1L)
## want to sample just one of these
want <- sample(want, 1)
## return the selected row & retun
x[want, , drop = FALSE]
}
:
foo()
它们本质上是相同的操作,但只选择1行的scr-1
使预期的行为显式化;我们想从得分== 1L的那些中随机选择1行,而不是样本{{1}}值。
答案 2 :(得分:0)
ugroups<-unique(data[,1])
scores<-sapply(ugroups,function(x){sum(data[,1]==x & data[,3]==1)})
data[data[,1]%in%ugroups[scores>0],]
....... etc
将为您提供每组等的累积分数