我在数据集中有一组变量-我想根据以前的 all ,简单地计算所有这些变量的运行总计(以及运行平均值)。
进行说明。这就是我的数据的样子,包括我想生成的总运行变量。
country year X1 X2 X3 X4 X5 running_total
Bahamas 1990 0 0 0 0 1 NA
Bahamas 1991 0 0 1 1 0 1
Bahamas 1992 1 1 0 0 1 3
Bahamas 1993 0 0 0 0 0 6
Bahamas 1994 1 1 0 1 1 6
Bahamas 1995 0 0 1 0 0 10
Bahamas 1996 0 1 0 1 0 11
Bahamas 1997 1 0 1 0 1 13
Bahamas 1998 0 1 0 1 0 16
Bahamas 1999 1 0 1 0 1 18
Bahamas 2000 0 1 0 1 0 21
Bahamas 2001 1 0 1 0 1 23
Bahamas 2002 0 1 0 1 0 26
Bahamas 2003 1 0 0 0 1 28
Bahamas 2004 0 0 0 1 0 30
Bahamas 2005 1 1 0 0 0 31
Bahamas 2006 0 0 1 1 1 33
Bahamas 2007 1 0 0 0 0 36
Bahamas 2008 0 0 1 1 1 37
Bahamas 2009 1 1 0 0 0 40
Bahamas 2010 0 0 1 1 1 42
Bahamas 2011 1 1 0 0 0 45
Bolivia 1990 0 0 0 0 0 NA
Bolivia 1991 0 0 1 1 0 0
Bolivia 1992 0 0 0 0 0 2
Bolivia 1993 0 0 1 0 0 2
Bolivia 1994 0 0 0 0 0 3
Bolivia 1995 0 0 0 0 0 3
Bolivia 1996 0 0 0 0 0 3
Bolivia 1997 0 0 0 0 0 3
Bolivia 1998 0 0 0 0 0 3
Bolivia 1999 0 0 0 0 0 3
Bolivia 2000 0 1 0 1 0 3
Bolivia 2001 0 0 0 0 0 5
Bolivia 2002 0 0 0 0 0 5
Bolivia 2003 0 0 0 0 0 5
Bolivia 2004 0 0 0 0 0 5
Bolivia 2005 0 0 0 0 0 5
Bolivia 2006 0 0 0 0 0 5
Bolivia 2007 0 0 0 0 0 5
Bolivia 2008 0 0 0 0 1 5
Bolivia 2009 0 0 0 0 0 6
Bolivia 2010 0 0 0 0 1 6
Bolivia 2011 0 0 0 0 0 7
1990年开始== NA。例如,1991年的总计基于1990。1992年的总计基于1990-1991。 1993年的总运行量基于1990-1992年-1994年的总运行量基于1990-1993年。依此类推...直到2011年。然后,它对新国家B开始相同的程序。
我在下面尝试了以下代码,但是它不能按照我想要的方式工作。当然,我需要更好地指定它,但是如何呢?
DF$csum <- ave(DF$X1, DF$X2,DF$X3,DF$X4,DF$X5,FUN=cumsum)
此外,我想基于相同的逻辑生成运行平均值。
这里的任何帮助将不胜感激!
structure(list(country = country(c(1L,1L,1L,1L,1L,1L,1L,1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,2L,2L,2L, 2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L, 2L,2L),.Label = c(“ Bahamas”,“ Bolivia”),class =“ factor”), 年= c(1990L,1991L,1992L,1993L,1994L,1995L,1996L, 1997L,1998L,1999L,2000L,2001L,2002L,2003L,2004L,2005L, 2006L,2007L,2008L,2009L,2010L,2011L,1990L,1991L,1992L, 1993L,1994L,1995L,1996L,1997L,1998L,1999L,2000L,2001L, 2002L,2003L,2004L,2005L,2006L,2007L,2008L,2009L,2010L, 2011L),X1 = c(0L,0L,1L,0L,1L,0L,0L,1L,0L,1L,0L, 1L,0L,1L,0L,1L,0L,1L,0L,1L,0L,1L,0L,0L,0L,0L, 0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L, 0L,0L,0L),X2 = c(0L,0L,1L,0L,1L,0L,1L,0L,1L,0L, 1L,0L,1L,0L,0L,1L,0L,0L,0L,1L,0L,1L,0L,0L,0L, 0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L, 0L,0L,0L,0L),X3 = c(0L,1L,0L,0L,0L,1L,0L,1L,0L, 1L,0L,1L,0L,0L,0L,0L,1L,0L,1L,0L,1L,0L,0L,1L, 0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L, 0L,0L,0L,0L,0L),X4 = c(0L,1L,0L,0L,1L,0L,1L,0L, 1L,0L,1L,0L,1L,0L,1L,0L,1L,0L,1L,0L,1L,0L,0L, 1L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,0L,0L, 0L,0L,0L,0L,0L,0L),X5 = c(1L,0L,1L,0L,1L,0L,0L, 1L,0L,1L,0L,1L,0L,1L,0L,0L,1L,0L,1L,0L,1L,0L, 0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L, 0L,0L,0L,1L,0L,1L,0L),running_total = c(NA,1L,3L, 6L,6L,10L,11L,13L,16L,18L,21L,23L,26L,28L,30L, 31L,33L,36L,37L,40L,42L,45L,NA,0L,2L,2L,3L,3L, 3L,3L,3L,3L,3L,5L,5L,5L,5L,5L,5L,5L,5L,6L,6L, 7L)),. Names = c(“国家/地区”,“年份”,“ X1”,“ X2”,“ X3”,“ X4”,“ X5”,“ running_total”),类=“ data.frame”, row.names = c(NA, -44L))
答案 0 :(得分:2)
library(data.table)
setDT(df)
df[, xt := X1+X2+X3+X4+X5]
df[, rt2 := shift(cumsum(xt)), by = country]
实际上,可以通过单线解决:
df[, rt3 := {xt=X1+X2+X3+X4+X5; shift(cumsum(xt))}, by = country]
# Or as Ryan points out:
df[, rt2 := shift(cumsum(Reduce(`+`, .SD))) , by = country , .SDcols = grep('^X.*', names(df), value = T)]
所有结果:
country year X1 X2 X3 X4 X5 running_total xt rt2
1: Bahamas 1990 0 0 0 0 1 NA 1 NA
2: Bahamas 1991 0 0 1 1 0 1 2 1
3: Bahamas 1992 1 1 0 0 1 3 3 3
4: Bahamas 1993 0 0 0 0 0 6 0 6
5: Bahamas 1994 1 1 0 1 1 6 4 6
6: Bahamas 1995 0 0 1 0 0 10 1 10
7: Bahamas 1996 0 1 0 1 0 11 2 11
8: Bahamas 1997 1 0 1 0 1 13 3 13
9: Bahamas 1998 0 1 0 1 0 16 2 16
10: Bahamas 1999 1 0 1 0 1 18 3 18
11: Bahamas 2000 0 1 0 1 0 21 2 21
12: Bahamas 2001 1 0 1 0 1 23 3 23
13: Bahamas 2002 0 1 0 1 0 26 2 26
14: Bahamas 2003 1 0 0 0 1 28 2 28
15: Bahamas 2004 0 0 0 1 0 30 1 30
16: Bahamas 2005 1 1 0 0 0 31 2 31
17: Bahamas 2006 0 0 1 1 1 33 3 33
18: Bahamas 2007 1 0 0 0 0 36 1 36
19: Bahamas 2008 0 0 1 1 1 37 3 37
20: Bahamas 2009 1 1 0 0 0 40 2 40
21: Bahamas 2010 0 0 1 1 1 42 3 42
22: Bahamas 2011 1 1 0 0 0 45 2 45
23: Bolivia 1990 0 0 0 0 0 NA 0 NA
24: Bolivia 1991 0 0 1 1 0 0 2 0
25: Bolivia 1992 0 0 0 0 0 2 0 2
26: Bolivia 1993 0 0 1 0 0 2 1 2
27: Bolivia 1994 0 0 0 0 0 3 0 3
28: Bolivia 1995 0 0 0 0 0 3 0 3
29: Bolivia 1996 0 0 0 0 0 3 0 3
30: Bolivia 1997 0 0 0 0 0 3 0 3
31: Bolivia 1998 0 0 0 0 0 3 0 3
32: Bolivia 1999 0 0 0 0 0 3 0 3
33: Bolivia 2000 0 1 0 1 0 3 2 3
34: Bolivia 2001 0 0 0 0 0 5 0 5
35: Bolivia 2002 0 0 0 0 0 5 0 5
36: Bolivia 2003 0 0 0 0 0 5 0 5
37: Bolivia 2004 0 0 0 0 0 5 0 5
38: Bolivia 2005 0 0 0 0 0 5 0 5
39: Bolivia 2006 0 0 0 0 0 5 0 5
40: Bolivia 2007 0 0 0 0 0 5 0 5
41: Bolivia 2008 0 0 0 0 1 5 1 5
42: Bolivia 2009 0 0 0 0 0 6 0 6
43: Bolivia 2010 0 0 0 0 1 6 1 6
44: Bolivia 2011 0 0 0 0 0 7 0 7
country year X1 X2 X3 X4 X5 running_total xt rt2
答案 1 :(得分:2)
df = structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bahamas", "Bolivia"), class = "factor"), year = c(1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L), X1 = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X2 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X3 = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X4 = c(0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X5 = c(1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), running_total = c(NA, 1L, 3L, 6L, 6L, 10L, 11L, 13L, 16L, 18L, 21L, 23L, 26L, 28L, 30L, 31L, 33L, 36L, 37L, 40L, 42L, 45L, NA, 0L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L)), .Names = c("country", "year", "X1", "X2", "X3", "X4", "X5", "running_total"), class = "data.frame", row.names = c(NA, -44L))
df <- df %>% mutate(sums = X1 + X2 + X3 +X4 + X5) %>%
group_by(country) %>% mutate(sum_shift = shift(sums),
sum_shift = ifelse(is.na(sum_shift), 0, sum_shift),
running_total = cumsum(sum_shift))
head(df)
country year X1 X2 X3 X4 X5 running_total sums sum_shift
1: Bahamas 1990 0 0 0 0 1 0 1 0
2: Bahamas 1991 0 0 1 1 0 1 2 1
3: Bahamas 1992 1 1 0 0 1 3 3 2
4: Bahamas 1993 0 0 0 0 0 6 0 3
5: Bahamas 1994 1 1 0 1 1 6 4 0
6: Bahamas 1995 0 0 1 0 0 10 1 4
这是dplyr解决方案,但它与数据表解决方案基本相同。我们创建一列,在其中汇总各行。然后,我们按国家/地区分组并求和,然后创建一个累计和。为了使总和起作用,我们必须将nas设置为0。
答案 2 :(得分:1)
使用dplyr
和purrr
的解决方案。我们可以用country
分割数据帧,创建running_total
列,然后合并数据帧。请注意,此解决方案不需要指定单个列名,例如X1
和X2
。 dat2
是最终输出。
library(dplyr)
library(purrr)
dat2 <- dat %>%
split(.$country) %>%
map_dfr(~mutate(.x,
running_total =
as.integer(lag(cumsum(rowSums(select(.x, starts_with("X"))))))))
要计算移动平均值,我们可以通过将命令添加到mutate
函数来遵循相同的逻辑。请注意,cummean
函数来自dplyr
包。
dat2 <- dat %>%
split(.$country) %>%
map_dfr(~mutate(.x,
running_total =
as.integer(lag(cumsum(rowSums(select(.x, starts_with("X")))))),
running_mean =
lag(cummean(rowSums(select(.x, starts_with("X")))))))
数据
dat <- structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bahamas", "Bolivia"), class = "factor"), year = c(1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L), X1 = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X2 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X3 = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X4 = c(0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X5 = c(1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), running_total = c(NA, 1L, 3L, 6L, 6L, 10L, 11L, 13L, 16L, 18L, 21L, 23L, 26L, 28L, 30L, 31L, 33L, 36L, 37L, 40L, 42L, 45L, NA, 0L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L)), .Names = c("country", "year", "X1", "X2", "X3", "X4", "X5", "running_total"), class = "data.frame", row.names = c(NA, -44L))
dat$running_total <- NULL