如何基于R中具有missind数据的Date列计算数据框中多列的月平均值

时间:2016-01-04 12:26:49

标签: r if-statement dataframe average multiple-columns

我的数据框中有超过4000的大量列。一列是日期,其余是公司(列名)。我每天都有超过14年的行为(作为行),使其为164个月。我想根据日期列计算平均值,并且只有当每列至少有15个观察值时才计算所有平均值中最重要的(公司) )否则应该返回NA。

df<- Spread
Date             A             B    C
2000-01-04  0.062893082 0.030769231 NA
2000-01-05  0.062893082 0.015503876 NA
2000-01-06  0.062893082 NA          NA
2000-01-07  0.062893082 NA          NA
2000-01-10  0.062893082 NA          NA
2000-01-11  0.062893082 NA          NA
2000-01-12  0.062893082 NA          NA
2000-01-13  0.062893082 NA          NA
2000-01-14  0.062893082 NA          NA
2000-01-17  0.052910053 NA          NA
2000-01-18  0.031413613 NA          NA
2000-01-19  0.052910053 NA          NA
2000-01-20  0.051282051 NA          NA
2000-01-21  0.051282051 0.014184397 NA
2000-01-24  0.051282051 0.014184397 NA
2000-01-25  0.051282051 0.014184397 NA
2000-01-26  0.051282051 0.014184397 NA
2000-01-27  0.051282051 0.019914651 NA
2000-01-28  0.031088083 0.028571429 NA
2000-01-31  0.031088083 0.028571429 NA

我想要的输出

Monthly<- df
Month          A        B   C
Jan-2000    0.053656996 NA  NA

我真的会帮助你。我想要的任何想法将这些值四舍五入到小数点后4位,例如0.062893082到0.0628。

1 个答案:

答案 0 :(得分:3)

我们可以使用data.table。我们将'data.frame'转换为'data.table'(setDT(df1)),然后我们使用format来提取月 - 年(转换为Date类后)。这可以用作分组变量。我们遍历列(lapply(.SD,...)和if非NA元素的length大于或等于15获取meanelse返回作为NA。

library(data.table)
setDT(df1)[,lapply(.SD, function(x) if(length(na.omit(x)) >=15)
       mean(x, na.rm=TRUE) else NA_real_) ,
             by = .(Month= format(as.IDate(Date), '%b-%Y'))]
#      Month        A  B  C
#1: Jan-2000 0.053657 NA NA

使用dplyr的类似方法是

library(dplyr)
df1 %>% 
    group_by(Month = format(as.Date(Date), '%b-%Y')) %>%
    summarise_each(funs( if(length(na.omit(.))>=15) 
                       mean(., na.rm=TRUE) else NA_real_), A:C)
#    Month        A     B     C
#     (chr)    (dbl) (dbl) (dbl)
#1 Jan-2000 0.053657    NA    NA

数据

df1 <- structure(list(Date = c("2000-01-04", "2000-01-05", "2000-01-06", 
"2000-01-07", "2000-01-10", "2000-01-11", "2000-01-12", "2000-01-13", 
"2000-01-14", "2000-01-17", "2000-01-18", "2000-01-19", "2000-01-20", 
"2000-01-21", "2000-01-24", "2000-01-25", "2000-01-26", "2000-01-27", 
"2000-01-28", "2000-01-31"), A = c(0.062893082, 0.062893082, 
0.062893082, 0.062893082, 0.062893082, 0.062893082, 0.062893082, 
0.062893082, 0.062893082, 0.052910053, 0.031413613, 0.052910053, 
0.051282051, 0.051282051, 0.051282051, 0.051282051, 0.051282051, 
0.051282051, 0.031088083, 0.031088083), B = c(0.030769231, 0.015503876, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.014184397, 0.014184397, 
0.014184397, 0.014184397, 0.019914651, 0.028571429, 0.028571429
), C = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA)), .Names = c("Date", "A", "B", "C"
), class = "data.frame", row.names = c(NA, -20L))