[R]中的条件分组和汇总数据框

时间:2015-05-27 16:06:04

标签: r dplyr

我有一个这样的数据框:

df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"), 
                 time = c(3.1,3.2,6.5,12.3, 3.2, 3.4), 
                 intensity = c(10, 20, 30, 40, 50, 60))
|ID | time| intensity|
|:--|----:|---------:|
|A  |  3.1|        10|
|A  |  3.2|        20|
|B  |  6.5|        30|
|B  | 12.3|        40|
|C  |  3.2|        50|
|C  |  3.4|        60|

当时差小于0.3时,我想通过ID 聚合值(和强度)。首先,我计算了这个时差:

df.2 <- df %>% 
        group_by(ID) %>% 
        mutate(time.diff = max(time) - min(time)) 

...导致:

|ID | time| intensity| time.diff|
|:--|----:|---------:|---------:|
|A  |  3.1|        10|       0.1|
|A  |  3.2|        20|       0.1|
|B  |  6.5|        30|       5.8|
|B  | 12.3|        40|       5.8|
|C  |  3.2|        50|       0.2|
|C  |  3.4|        60|       0.2|

为了清楚起见,我希望得到的结果是:

|ID | time| intensity| time.diff|
|:--|----:|---------:|---------:|
|A  | 3.15|        30|       0.1|
|B  |  6.5|        30|       5.8|
|B  | 12.3|        40|       5.8|
|C  |  3.3|       110|       0.2|

现在时间是综合观测的平均值,而强度是它们的总和。 ID&#34; B&#34;保持两个观察,因为它的时间差大于0.3。我尝试过使用dplyr,但总结总会丢掉其中一个&#34; B&#34;的观察结果,我想保留它们,我不知道该怎么做条件 _group_by_。

我感谢你的任何想法!!

4 个答案:

答案 0 :(得分:3)

data.table

的可能选项
library(data.table)
unique(setDT(df)[, time.diff := max(time)-min(time), ID][
   time.diff <= 0.3, c('time', 'intensity') := list(mean(time),
        sum(intensity)), ID]) 
#    ID  time intensity time.diff
#1:  A  3.15        30       0.1
#2:  B  6.50        30       5.8
#3:  B 12.30        40       5.8
#4:  C  3.30       110       0.2

或使用dplyr

library(dplyr)
df %>% 
   group_by(ID) %>%
   mutate(time.diff=max(time)-min(time), indx=all(time.diff<=0.3),
         intensity=ifelse(indx, sum(intensity), intensity),
         time=ifelse(indx, mean(time), time)) %>% 
   filter(!indx|row_number()==1) %>%
   select(-indx)
 #  ID  time intensity time.diff
 #1  A  3.15        30       0.1
 #2  B  6.50        30       5.8
 #3  B 12.30        40       5.8
 #4  C  3.30       110       0.2

答案 1 :(得分:3)

data.table解决方案的另一种变体:

setDT(df)[, time.diff := max(time) - min(time), by = ID
        ][, if (time.diff <= 0.3) 
                .(time = mean(time), intensity = sum(intensity))
            else .SD, by = .(ID, time.diff)]
#    ID time.diff  time intensity
# 1:  A       0.1  3.15        30
# 2:  B       5.8  6.50        30
# 3:  B       5.8 12.30        40
# 4:  C       0.2  3.30       110

答案 2 :(得分:1)

# get time.diff
df$time.diff <- ave(x = df$time,df$ID,FUN = function(x){max(x)-min(x)})

# new split variable to use with ID
df$cut <- cumsum(df$time.diff > .3)

# aggregate everything you need and ignore the cut variable
require(plyr)
ddply(df,c('cut','ID'),summarize,
      time = mean(time),
      intensity = sum(intensity),
      time.diff = mean(time.diff))[2:5]

答案 3 :(得分:1)

使用<p:commandLink action="#{verifyCredentials.save(klasse, modul)}">

sqldf

输出:

library(sqldf)
sqldf('SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif FROM df 
         GROUP BY ID 
         HAVING (MAX(time)-MIN(time))<0.3
         UNION
         SELECT ID, df.time, df.intensity, df2.dif
         FROM (SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif
         FROM df 
         GROUP BY ID 
         HAVING (MAX(time)-MIN(time))>0.3) as df2
         LEFT JOIN df USING (ID)')