Question

我有一个像这样的大数据集：

 df <- data.frame(group = c(rep(1, 6), rep(5, 6)), score = c(30, 10, 22, 44, 6, 5, 20, 35, 2, 60, 14,5)) 

      group score
 1      1    30
 2      1    10
 3      1    22
 4      1    44
 5      1     6
 6      1     5
 7      5    20
 8      5    35
 9      5     2
 10     5    60
 11     5    14
 12     5     5

...

我想对每组中的每个相邻分数进行减法，如果差异大于30，则删除较小的分数。例如，在组1内，30-10 = 20 <30,10-22 = -12 <30,22-44 = -22 <30,44-6 = 38> 30（去除6），44-5 = 39＆gt; ; 30（删除5）...预期输出应如下所示：

      group score
 1     1    30
 2     1    10
 3     1    22
 4     1    44
 5     5    20
 6     5    35
 7     5    60

...

有没有人知道要实现这个目标？

Answer 1

喜欢这个吗？

repeat {
  df$diff=unlist(by(df$score,df$group,function(x)c(0,-diff(x))))
  if (all(df$diff<30)) break
  df <- df[df$diff<30,]
}
df$diff <- NULL
df
#    group score
# 1      1    30
# 2      1    10
# 3      1    22
# 4      1    44
# 7      5    20
# 8      5    35
# 10     5    60

这（似乎......）需要迭代方法，因为“相邻分数”在删除一行后会发生变化。因此，在删除6之前，差异为44 - 6 > 30，但6 - 5 < 30。删除6后，差异为44 - 5 > 30。

因此，这将按组计算连续行之间的差异（使用by(...)和diff(...)），并删除相应的行，然后重复该过程，直到所有差异为＆lt; 30。

Answer 2

它不优雅，但应该有效：

out = data.frame(group = numeric(), score=numeric())
#cycle through the groups
for(g in levels(as.factor(df$group))){
    temp = subset(df, df$group==g)
    #now go through the scores
    left = temp$score[1]
    for(s in seq(2, length(temp$score))){
        if(left - temp$score[s] > 30){#Test the condition
            temp$score[s] = NA
        }else{
            left = temp$score[s] #if condition not met then the 
        }   
    }
    #Add only the rows without NAs to the out
    out = rbind(out, temp[which(!is.na(temp$score)),])
}

应该有一种方法可以使用ave来执行此操作，但如果diff＆gt; 30很棘手，则在删除下一个值时会带有最后一个值！如果有更好的解决方案，我会很感激。

Answer 3

你可以尝试

df
##    group score
## 1      1    30
## 2      1    10
## 3      1    22
## 4      1    44
## 5      1     6
## 6      1     5
## 7      5    20
## 8      5    35
## 9      5     2
## 10     5    60
## 11     5    14
## 12     5     5

tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
while (!identical(df, tmp)) {
    df <- tmp
    tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
}
tmp
##    group score
## 1      1    30
## 2      1    10
## 3      1    22
## 4      1    44
## 7      5    20
## 8      5    35
## 10     5    60

根据减法结果删除行

3 个答案: