在R中我有一个data.frame,它有几个已经过几年测量的变量。我想得出每个变量的月平均值(使用所有年份)。理想情况下,这些新变量将全部放在一个新的data.frame(携带ID)中,下面我只是将新变量添加到data.frame中。我现在知道如何做到这一点的唯一方法(下面)似乎相当费力,而且我希望在R中可能有一种更聪明的方法来做到这一点,不需要每个月输入和变量,如下所示。
# Example data.frame with only two years, two month, and two variables
# In the real data set there are always 12 months per year
# and there are at least four variables
df<- structure(list(ID = 1:4, ABC.M1Y2001 = c(10, 12.3, 45, 89), ABC.M2Y2001 = c(11.1,
34, 67.7, -15.6), ABC.M1Y2002 = c(-11.1, 9, 34, 56.5), ABC.M2Y2002 = c(12L,
13L, 11L, 21L), DEF.M1Y2001 = c(14L, 14L, 14L, 16L), DEF.M2Y2001 = c(15L,
15L, 15L, 12L), DEF.M1Y2002 = c(5, 12, 23.5, 34), DEF.M2Y2002 = c(6L,
34L, 61L, 56L)), .Names = c("ID", "ABC.M1Y2001", "ABC.M2Y2001","ABC.M1Y2002",
"ABC.M2Y2002", "DEF.M1Y2001", "DEF.M2Y2001", "DEF.M1Y2002",
"DEF.M2Y2002"), class = "data.frame", row.names = c(NA, -4L))
# list variable to average for ABC Month 1 across years
ABC.M1.names <- c("ABC.M1Y2001", "ABC.M1Y2002")
df <- transform(df, ABC.M1 = rowMeans(df[,ABC.M1.names], na.rm = TRUE))
# list variable to average for ABC Month 2 across years
ABC.M2.names <- c("ABC.M2Y2001", "ABC.M2Y2002")
df <- transform(df, ABC.M2 = rowMeans(df[,ABC.M2.names], na.rm = TRUE))
# and so forth for ABC
# ...
# list variables to average for DEF Month 1 across years
DEF.M1.names <- c("DEF.M1Y2001", "DEF.M1Y2002")
df <- transform(df, DEF.M1 = rowMeans(df[,DEF.M1.names], na.rm = TRUE))
# and so forth for DEF
# ...
答案 0 :(得分:2)
以下是使用data.table
开发版v1.8.11的解决方案(其中为data.table实现了melt
和cast
方法):
require(data.table)
require(reshape2) # melt/cast builds on S3 generic from reshape2
dt <- data.table(df) # where df is your data.frame
dcast.data.table(melt(dt, id="ID")[, sum(value)/.N, list(ID,
gsub("Y.*$", "", variable))], ID ~ gsub)
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1: 1 -0.55 11.55 9.50 10.5
2: 2 10.65 23.50 13.00 24.5
3: 3 39.50 39.35 18.75 38.0
4: 4 72.75 2.70 25.00 34.0
您只需cbind
这个原始数据。
请注意,sum
是一个原语,其中mean
是S3泛型。因此,使用sum(.)/length(.)
会更好(就好像分组太多,为每个组调度带有mean
的正确方法可能是非常耗时的操作)。 .N
是data.table中的一个特殊变量,它直接为您提供组的长度。
答案 1 :(得分:1)
以下是使用reshape2
的解决方案,当您拥有大量数据并使用正则表达式提取变量名称和月份时,该解决方案会更加自动化。这个解决方案将为您提供一个很好的汇总表。
# Load required package
require(reshape2)
# Melt your wide data into long format
mdf <- melt(df , id = "ID" )
# Extract relevant variable names from the variable colum
mdf$Month <- gsub( "^.*\\.(M[0-9]{1,2}).*$" , "\\1" , mdf$variable )
mdf$Var <- gsub( "^(.*)\\..*" , "\\1" , mdf$variable )
# Aggregate by month and variable
dcast( mdf , Var ~ Month , mean )
# Var M1 M2
#1 ABC 30.5875 19.275
#2 DEF 16.5625 26.750
或与其他解决方案兼容,并按ID
返回表格...
dcast( mdf , ID ~ Var + Month , mean )
# ID ABC_M1 ABC_M2 DEF_M1 DEF_M2
#1 1 -0.55 11.55 9.50 10.5
#2 2 10.65 23.50 13.00 24.5
#3 3 39.50 39.35 18.75 38.0
#4 4 72.75 2.70 25.00 34.0
答案 2 :(得分:1)
这在基地R中非常直接。
mean.names <- split(names(df)[-1], gsub('Y[0-9]{4}$', '', names(df)[-1]))
means <- lapply(mean.names, function(x) rowMeans(df[, x], na.rm = TRUE))
data.frame(df, means)
这会为您的原始data.frame
提供以下四列:
ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 -0.55 11.55 9.50 10.5
2 10.65 23.50 13.00 24.5
3 39.50 39.35 18.75 38.0
4 72.75 2.70 25.00 34.0
答案 3 :(得分:1)
您可以使用包{splitstackshape}中的Reshape
,然后使用plyr package或data.table或base R来执行均值。
library(splitstackshape) # Reshape
library(plyr) # ddply
kk<-Reshape(df,id.vars="ID",var.stubs=c("ABC.M1","ABC.M2","DEF.M1","DEF.M2"),sep="")
> kk
ID AE DB time ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 NA NA 1 10.0 11.1 14.0 15
2 2 NA NA 1 12.3 34.0 14.0 15
3 3 NA NA 1 45.0 67.7 14.0 15
4 4 NA NA 1 89.0 -15.6 16.0 12
5 1 NA NA 2 -11.1 12.0 5.0 6
6 2 NA NA 2 9.0 13.0 12.0 34
7 3 NA NA 2 34.0 11.0 23.5 61
8 4 NA NA 2 56.5 21.0 34.0 56
ddply(kk[,c(1,5:8)],.(ID),colwise(mean))
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 -0.55 11.55 9.50 10.5
2 2 10.65 23.50 13.00 24.5
3 3 39.50 39.35 18.75 38.0
4 4 72.75 2.70 25.00 34.0