如何在计算一列中的出现次数并将其他值中的值相加时对数据框中的行进行分组?

时间:2016-11-19 18:03:38

标签: r plyr

我正在尝试修改我的数据框:

  start end duration_time
1     1   2         2.438
2     2   1         3.901
3     1   2        18.037
4     2   3        85.861
5     3   4        83.922

并创建如下内容:

  start end duration_time weight
1     1   2        20.475      2
2     2   1         3.901      1
4     2   3        85.861      1
5     3   4        83.922      1

因此,将删除重复的起始端组合,权重将提高,持续时间将总和

我已经有一部分工作了,我无法减轻工作量:

library('plyr')

df <- read.table(header = TRUE, text = "start end duration_time
1     1   2         2.438
2     2   1         3.901
3     1   2        18.037
4     2   3        85.861
5     3   4        83.922")

ddply(df, c("start","end"), summarise, weight=? ,duration_time=sum(duration_time))

2 个答案:

答案 0 :(得分:1)

base R选项为aggregate

do.call(data.frame, aggregate(duration_time~., df1,
       FUN = function(x) c(duration_time=sum(x), weight = length(x))))

答案 1 :(得分:0)

使用data.table的最简单的解决方案:

library(data.table)
setDT(df)[, .(duration_time=sum(duration_time), wt = .N) , by  =c("start", "end")]

   start end duration_time wt
1:     1   2        20.475  2
2:     2   1         3.901  1
3:     2   3        85.861  1
4:     3   4        83.922  1

使用dplyr,tidyr

尝试一些事情
library(dplyr)
library(tidyr)
df1 <- df %>% unite(by_var, start,end)
df2 <- cbind(df1 %>% count(by_var), df1 %>% group_by(by_var)%>% 
    summarise( duration_time=sum(duration_time))%>%
    separate(by_var, c("start","end")))[c(3,4,5,2)]

> df2
  start end duration_time n
1     1   2        20.475 2
2     2   1         3.901 1
3     2   3        85.861 1
4     3   4        83.922 1