跨多个日期列的均值

时间:2018-08-14 14:17:20

标签: r dataframe mean

我有一个很大的数据框,其中有些列是日期,但采用字符格式,例如:

name <- c("John ", "Jay", "Carla")
X3.12.2010 <- c(20, 10, 9)
X3.19.2010 <- c(19, 8, 44)
X3.26.2010 <- c(10, 100, 999)
X4.3.2010 <- c(8, 1, 23)
X4.10.2010 <- c(8, 10, 238)
X4.17.2010 <- c(28, 17, 27)
X4.24.2010 <- c(11, 12, 45)
g <- data.frame(name, X3.12.2010, X3.19.2010, X3.26.2010, X4.3.2010, X4.10.2010, X4.17.2010, X4.24.2010)

但是,我希望日期列采用“ yyyymm”格式,然后对日期和名称的每个唯一组合取均值。我使用以下代码转换日期列:

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

colnames(g)[2:8] <- ifelse(nchar(sub(" X", "", paste(substrRight(colnames(g)[2:8], 4),str_extract(colnames(g)[2:8], "[^.]+")))) < 6, 
                                   sub(" X", 0, paste(substrRight(colnames(g)[2:8], 4),str_extract(colnames(g)[2:8], "[^.]+"))), 
                                   sub(" X", "", paste(substrRight(colnames(g)[2:8], 4),str_extract(colnames(g)[2:8], "[^.]+"))))
 print(g)

   name 201003 201003 201003 201004 201004 201004 201004
1 John      20     19     10      8      8     28     11
2   Jay     10      8    100      1     10     17     12
3 Carla      9     44    999     23    238     27     45

我想要的输出如下:

   name X201003 X201004
1 John    16.33   13.75
2   Jay   39.33   10.00
3 Carla  350.66   83.25

有没有办法产生这个?谢谢。

1 个答案:

答案 0 :(得分:1)

关于存储数据的评论

不使用相同名称的列是一个好习惯。这没有任何意义,最好在源上(即从您那里获取数据的位置)进行更正。

d = data.frame(name = c("John", "Jay", "Carla","John", "Jay", "Carla","John", "Jay", "Carla"),
               month = c(201003, 201003, 201003,201003, 201003, 201003,201004, 201004, 201004),
               order = c(1,1,1,2,2,2,1,1,1),
               value = c(20,10,9,19,8,44,8,10,238))

#    name  month order value
# 1  John 201003     1    20
# 2   Jay 201003     1    10
# 3 Carla 201003     1     9
# 4  John 201003     2    19
# 5   Jay 201003     2     8
# 6 Carla 201003     2    44
# 7  John 201004     1     8
# 8   Jay 201004     1    10
# 9 Carla 201004     1   238

发布问题的解决方案

为了重塑形状,我们必须为您的列创建不同的名称,然后在以后的阶段提取时间以对数据进行分组并计算均值:

library(tidyverse)

# set as data frame to get columns with different names
g = data.frame(g)

g %>%
  gather(time,value,-name) %>%                        # reshape data
  mutate(time = gsub('X([^.]+)|.', '\\1', time)) %>%  # get time from column names (everything between "X" and ".")
  group_by(name, time) %>%                            # for each name and time
  summarise(MEAN = mean(value)) %>%                   # get mean value
  ungroup() %>%                                       # forget the grouping
  spread(time, MEAN)                                  # reshape again

# # A tibble: 3 x 3
#   name    `201003` `201004`
#   <fct>      <dbl>    <dbl>
# 1 Carla      351.      83.2
# 2 Jay         39.3     10  
# 3 John       16.3     13.8