数据框列的平均值

时间:2016-02-15 15:07:33

标签: r dataframe

我有一个data.frame,其中包含一组观察的不同年份数据。列的名称是年份,重复的年份由年份标识,后跟".1"20082008.1重复的年份。

第一次观察dput()的{​​{1}}如下:

data.frame

我想计算一年中的平均值和重复的年份(2008年和2008年1月)。为了简化这个过程,我尝试了每一年重复的循环:

structure(list(ID = 2174L, `1992` = 0L, `1993` = 0L, `1994` = 0L, 
    `1994.1` = 0L, `1995` = 0L, `1996` = 0L, `1997` = 0L, `1998` = 0L, 
    `1999` = 0L, `1997.1` = 0L, `1998.1` = 0L, `1999.1` = 0L, 
    `2000` = 0L, `2001` = 0L, `2002` = 0L, `2003` = 0L, `2000.1` = 0L, 
    `2001.1` = 0L, `2002.1` = 0L, `2003.1` = 0L, `2004` = 0L, 
    `2005` = 0L, `2006` = 0L, `2007` = 0L, `2008` = 0L, `2004.1` = 0L, 
    `2005.1` = 0L, `2006.1` = 0L, `2007.1` = 0L, `2008.1` = 0L, 
    `2009` = 0L, `2010` = 0L, `2011` = 0L, `2012` = 0L, `2013` = 0L, 
    altura_mean_30arc = 341, dist_p = -1239.46778549383, dist_capital = 310537.289055982, 
    municode = 428, slope = 0.109233340937795, dist_f = -54589.0213329769), .Names = c("ID", 
"1992", "1993", "1994", "1994.1", "1995", "1996", "1997", "1998", 
"1999", "1997.1", "1998.1", "1999.1", "2000", "2001", "2002", 
"2003", "2000.1", "2001.1", "2002.1", "2003.1", "2004", "2005", 
"2006", "2007", "2008", "2004.1", "2005.1", "2006.1", "2007.1", 
"2008.1", "2009", "2010", "2011", "2012", "2013", "altura_mean_30arc", 
"dist_p", "dist_capital", "municode", "slope", "dist_f"), row.names = 2174L, class = "data.frame")

但结果是一组带有NA的新变量。我知道我可以使用 duplicated_years <- c("1994", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008") duplicated_years2 <- str_c(duplicated_years, "1", sep = ".") for(i in as.numeric(duplicated_years)){ for(j in as.numeric(duplicated_years2)){ df[, str_c(i, "mean", sep="_")] <- ((df$i + df$j) / 2) } } 代替,但索引对我来说非常困难

1 个答案:

答案 0 :(得分:3)

当您使用宽格式并且有许多列可以按行操作时,最好(在R中)转换为长格式并在单列上操作。然后转换回宽格式(如果需要)非常简单

例如,这里有一种方法可以找到包含一年的所有列

colindex <- grep("\\d{4}", names(df))

然后,使用data.table,我们可以选择那些(ID也是{),melt为长格式,计算每用户/年的均值,同时转换回宽格式。

library(data.table)
dcast(melt(setDT(df)[, c(1L,  colindex),  with = FALSE], id = 1L), 
      ID ~ sub("\\..*", "", variable), value.var = "value", mean)
#      ID 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
# 1: 2174    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0