根据下一行

时间:2018-02-27 14:52:59

标签: r dataframe aggregate

假设我有以下数据框:

Category = c("blue", "red",  "red", "blue", "blue", "blue", "red", "red", "red","blue", "red", "red","blue","blue","red","blue","red")
Purchase  = c(0,1,1,0,0,0,1,1,1,0,1,1,0,0,1,0,1)
Number  = c(1,1,1,1,2,2,2,2,2,1,1,2,2,2,2,2,2)
Id = c("a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b")
Country = c("NL","BE","BE","UK","UK","NL","UK","UK","UK","BE","NL","NL","BE","UK","UK","BE","NL")

df = data.frame(Id, Number,Category, Purchase, Country)
    > df
   Id Number Category Purchase Country
1   a      1     blue        0      NL
2   a      1      red        1      BE
3   a      1      red        1      BE
4   a      1     blue        0      UK
5   a      2     blue        0      UK
6   a      2     blue        0      NL
7   a      2      red        1      UK
8   a      2      red        1      UK
9   a      2      red        1      UK
10  b      1     blue        0      BE
11  b      1      red        1      NL
12  b      2      red        1      NL
13  b      2     blue        0      BE
14  b      2     blue        0      UK
15  b      2      red        1      UK
16  b      2     blue        0      BE
17  b      2      red        1      NL

我想聚合红色后跟红色的行,按Id和数字分组,以便汇总这些行的购买。因此,我想要的输出是:

    > desired
   Id Number Category Purchase Country
1   a      1     blue        0      NL
2   a      1      red        2      BE
3   a      1     blue        0      UK
4   a      2     blue        0      UK
5   a      2     blue        0      NL
6   a      2      red        3      UK
7   b      1     blue        0      BE
8   b      1      red        1      NL
9   b      2      red        1      NL
10  b      2     blue        0      BE
11  b      2     blue        0      UK
12  b      2      red        1      UK
13  b      2     blue        0      BE
14  b      2      red        1      NL

因此,应保持类别出现的顺序,并且只应聚合类别为“红色”的类别。另外,在我的实际数据框中,我有几个列,例如Country列,我也想在输出中出现,但我不想手动定义所有这些列。我曾尝试使用aggregate函数或ddply,但我仍然没有将其整理出来。

有人可以帮助我解决这个聚合问题,其中行的顺序被考虑在内吗?

3 个答案:

答案 0 :(得分:3)

以下是data.table的一个选项。转换' data.frame'到' data.table' (setDT(df)),按逻辑列(Category == "red")的游程编号ID以及' Id','编号'分组。和'类别',if元素数量大于1,all元素'类别'是' red'然后获得'购买'的sumelse返回'购买'

library(data.table)
setDT(df)[, .(Purchase = if(.N > 1 & all("red" %in% Category)) sum(Purchase) 
            else Purchase), by = .(grp = rleid(Category == "red"), Id, Number, Category)
          ][, grp := NULL][]
#    Id Number Category Purchase
# 1:  a      1     blue        0
# 2:  a      1      red        2
# 3:  a      1     blue        0
# 4:  a      2     blue        0
# 5:  a      2     blue        0
# 6:  a      2      red        3
# 7:  b      1     blue        0
# 8:  b      1      red        1
# 9:  b      2      red        1
#10:  b      2     blue        0
#11:  b      2     blue        0
#12:  b      2      red        1
#13:  b      2     blue        0
#14:  b      2      red        1

答案 1 :(得分:2)

df$temp = with(data = rle(as.character(df$Category)),
     cumsum(unlist(sapply(seq_along(values), function(i){
         if(values[i] == "red"){
             c(1, rep(0, lengths[i]-1))
         }else{
             rep(1, lengths[i])
         }}))))
aggregate(Purchase~., df, sum)
#   Id Number Category temp Purchase
#1   a      1     blue    1        0
#2   a      1      red    2        2
#3   a      1     blue    3        0
#4   a      2     blue    4        0
#5   a      2     blue    5        0
#6   a      2      red    6        3
#7   b      1     blue    7        0
#8   b      1      red    8        1
#9   b      2      red    8        1
#10  b      2     blue    9        0
#11  b      2     blue   10        0
#12  b      2      red   11        1
#13  b      2     blue   12        0
#14  b      2      red   13        1

答案 2 :(得分:1)

这是使用dpyr的方式。

首先,当组中的颜色发生变化时,我会建立一个递增的子组,以及IdNumber,它定义了子data.frames

然后我在包含do的子data.frames上使用red来聚合购买。

然后我清理组和额外的列。

df %>%
  group_by(Id,Number,subgroup = cumsum(c(TRUE,head(Category,-1) != tail(Category,-1)))) %>%
  do({if(.$Category[1] == "red") aggregate(Purchase ~ .,.,sum) else .}) %>%
  ungroup %>%
  select(-subgroup) 

# # A tibble: 14 x 4
#        Id Number Category Purchase
#    <fctr>  <dbl>   <fctr>    <dbl>
#  1      a      1     blue        0
#  2      a      1      red        2
#  3      a      1     blue        0
#  4      a      2     blue        0
#  5      a      2     blue        0
#  6      a      2      red        3
#  7      b      1     blue        0
#  8      b      1      red        1
#  9      b      2      red        1
# 10      b      2     blue        0
# 11      b      2     blue        0
# 12      b      2      red        1
# 13      b      2     blue        0
# 14      b      2      red        1