如何在R中的一个向量中累加值

时间:2014-01-29 02:38:21

标签: r row cumulative-sum dplyr

我有一个看起来像这样的数据集

id  name    year    job    job2
1   Jane    1980    Worker  0
1   Jane    1981    Manager 1
1   Jane    1982    Manager 1
1   Jane    1983    Manager 1
1   Jane    1984    Manager 1
1   Jane    1985    Manager 1
1   Jane    1986    Boss    0
1   Jane    1987    Boss    0
2   Bob     1985    Worker  0
2   Bob     1986    Worker  0
2   Bob     1987    Manager 1
2   Bob     1988    Boss    0
2   Bob     1989    Boss    0
2   Bob     1990    Boss    0
2   Bob     1991    Boss    0
2   Bob     1992    Boss    0

这里,job2表示一个虚拟变量,表示该年中某人是否为Manager。我想对这个数据集做两件事:首先,我只想在第一次成为Boss时保留该行。其次,我希望看到一个人作为Manager工作的累积年数,并将此信息存储在变量cumu_job2中。因此,我希望:

id  name    year    job    job2 cumu_job2
1   Jane    1980    Worker  0   0
1   Jane    1981    Manager 1   1
1   Jane    1982    Manager 1   2
1   Jane    1983    Manager 1   3
1   Jane    1984    Manager 1   4
1   Jane    1985    Manager 1   5
1   Jane    1986    Boss    0   0
2   Bob     1985    Worker  0   0
2   Bob     1986    Worker  0   0
2   Bob     1987    Manager 1   1
2   Bob     1988    Boss    0   0

我已经更改了我的示例并包含了Worker位置,因为这反映了我想要对原始数据集做更多的事情。只有数据集中只有Managers和Boss时,此线程中的答案才有效 - 因此任何有关此工作的建议都会很棒。我将非常感激!!

5 个答案:

答案 0 :(得分:21)

以下是针对同一问题的简洁dplyr解决方案。

注意:在读取数据时确保stringsAsFactors = FALSE

library(dplyr)
dat %>%
  group_by(name, job) %>%
  filter(job != "Boss" | year == min(year)) %>%
  mutate(cumu_job2 = cumsum(job2))

输出:

   id name year     job job2 cumu_job2
1   1 Jane 1980  Worker    0         0
2   1 Jane 1981 Manager    1         1
3   1 Jane 1982 Manager    1         2
4   1 Jane 1983 Manager    1         3
5   1 Jane 1984 Manager    1         4
6   1 Jane 1985 Manager    1         5
7   1 Jane 1986    Boss    0         0
8   2  Bob 1985  Worker    0         0
9   2  Bob 1986  Worker    0         0
10  2  Bob 1987 Manager    1         1
11  2  Bob 1988    Boss    0         0

解释

  1. 获取数据集
  2. 按名称和职位分组
  3. 根据条件过滤每个组
  4. 添加cumu_job2列。

答案 1 :(得分:9)

供稿人:Matthew Dowle:

dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
     by = list(name, job)]

解释

  1. 获取数据集
  2. 运行过滤器并在 D ata(.SD
  3. 的每个 S ubset中添加一列
  4. 按名称和工作分组

  5. 旧版本:

    这里有两个不同的分割应用组合。一个获得累积工作,另一个获得第一排老板身份。这是data.table中的一个实现,我们基本上分别进行每个分析(好吧,有点),然后使用rbind在一个地方收集所有内容。需要注意的主要事项是by=id部分,这基本上意味着对数据中的每个id分组评估其他表达式,这是您在尝试时遗漏的错误。

    library(data.table)
    dt <- as.data.table(df)
    dt[, cumujob:=0L]  # add column, set to zero
    dt[job2==1, cumujob:=cumsum(job2), by=id]  # cumsum for manager time by person 
    rbind(
      dt[job2==1],                     # this is just the manager portion of the data
      dt[job2==0, head(.SD, 1), by=id] # get first bossdom row
    )[order(id, year)]                 # order by id, year
    #       id name year     job job2 cumujob
    #   1:  1 Jane 1980 Manager    1       1
    #   2:  1 Jane 1981 Manager    1       2
    #   3:  1 Jane 1982 Manager    1       3
    #   4:  1 Jane 1983 Manager    1       4
    #   5:  1 Jane 1984 Manager    1       5
    #   6:  1 Jane 1985 Manager    1       6
    #   7:  1 Jane 1986    Boss    0       0
    #   8:  2  Bob 1985 Manager    1       1
    #   9:  2  Bob 1986 Manager    1       2
    #  10:  2  Bob 1987 Manager    1       3
    #  11:  2  Bob 1988    Boss    0       0
    

    请注意,此假设表在每个id内按年份排序,但如果不是那么容易修复。


    或者您也可以通过以下方式实现相同目标:

    ans <- dt[, .I[job != "Boss" | year == min(year)], by=list(name, job)]
    ans <- dt[ans$V1]
    ans[, cumujob := cumsum(job2), by=list(name,job)] 
    

    想法是基本上获取条件匹配的行号(使用.I - 内部变量),然后在这些行号(dt部分)上获取子集$v1,然后只需执行累积总和。

答案 2 :(得分:3)

以下是使用withinave的基本解决方案。我们假设输入为DF,并且数据按问题排序。

DF2 <- within(DF, {
    seq = ave(id, id, job, FUN = seq_along)
    job2 = (job == "Manager") + 0
    cumu_job2 = ave(job2, id, job, FUN = cumsum)
})
subset(DF2, job != 'Boss' | seq == 1, select = - seq)

修订:现在使用within

答案 3 :(得分:1)

我认为这样做符合您的要求,尽管数据必须按照您提供的方式进行排序。

my.df <- read.table(text = '
id  name    year    job    job2
1   Jane    1980    Worker  0
1   Jane    1981    Manager 1
1   Jane    1982    Manager 1
1   Jane    1983    Manager 1
1   Jane    1984    Manager 1
1   Jane    1985    Manager 1
1   Jane    1986    Boss    0
1   Jane    1987    Boss    0
2   Bob     1985    Worker  0
2   Bob     1986    Worker  0
2   Bob     1987    Manager 1
2   Bob     1988    Boss    0
2   Bob     1989    Boss    0
2   Bob     1990    Boss    0
2   Bob     1991    Boss    0
2   Bob     1992    Boss    0
', header = TRUE, stringsAsFactors = FALSE)

my.seq <- data.frame(rle(my.df$job)$lengths)

my.df$cumu_job2 <- as.vector(unlist(apply(my.seq, 1, function(x) seq(1,x))))

my.df2 <- my.df[!(my.df$job=='Boss' & my.df$cumu_job2 != 1),]
my.df2$cumu_job2[my.df2$job != 'Manager'] <- 0

   id name year     job job2 cumu_job2
1   1 Jane 1980  Worker    0         0
2   1 Jane 1981 Manager    1         1
3   1 Jane 1982 Manager    1         2
4   1 Jane 1983 Manager    1         3
5   1 Jane 1984 Manager    1         4
6   1 Jane 1985 Manager    1         5
7   1 Jane 1986    Boss    0         0
9   2  Bob 1985  Worker    0         0
10  2  Bob 1986  Worker    0         0
11  2  Bob 1987 Manager    1         1
12  2  Bob 1988    Boss    0         0

答案 4 :(得分:0)

@ BrodieG的方式更好:

数据

dat <- read.table(text="id  name    year    job    job2
1   Jane    1980    Manager 1
1   Jane    1981    Manager 1
1   Jane    1982    Manager 1
1   Jane    1983    Manager 1
1   Jane    1984    Manager 1
1   Jane    1985    Manager 1
1   Jane    1986    Boss    0
1   Jane    1987    Boss    0
2   Bob     1985    Manager 1
2   Bob     1986    Manager 1
2   Bob     1987    Manager 1
2   Bob     1988    Boss    0
2   Bob     1989    Boss    0
2   Bob     1990    Boss    0
2   Bob     1991    Boss    0
2   Bob     1992    Boss    0", header=TRUE)

#The code:

inds1 <- rle(dat$job2)
inds2 <- cumsum(inds1[[1]])[inds1[[2]] == 1] + 1

ends <- cumsum(inds1[[1]])
starts <- c(1, head(ends + 1, -1))
inds3 <- mapply(":", starts, ends)
dat$id <- rep(1:length(inds3), sapply(inds3, length))
dat <- do.call(rbind, lapply(split(dat[, 1:5], dat$id ), function(x) {
    if(x$job2[1] == 0){ 
        x$cumu_job2 <- rep(0, nrow(x))
    } else { 
        x$cumu_job2 <- 1:nrow(x)
    }
    x
}))


keeps <- dat$job2 > 0
keeps[inds2] <- TRUE
dat2 <- data.frame(dat[keeps, ], row.names = NULL)
dat2

##    id name year     job job2 cumu_job2
## 1   1 Jane 1980 Manager    1         1
## 2   1 Jane 1981 Manager    1         2
## 3   1 Jane 1982 Manager    1         3
## 4   1 Jane 1983 Manager    1         4
## 5   1 Jane 1984 Manager    1         5
## 6   1 Jane 1985 Manager    1         6
## 7   2 Jane 1986    Boss    0         0
## 8   3  Bob 1985 Manager    1         1
## 9   3  Bob 1986 Manager    1         2
## 10  3  Bob 1987 Manager    1         3
## 11  4  Bob 1988    Boss    0         0