R中的数据集如下所示:
LD.D LD.L LD.P
Y.1992.a1 67.89552605 33.21192862 90.7750688
Y.1992.a2 227.1370541 79.67211036 154.5165077
Y.1992.a3 94.5326718 24.72816922 151.665545
Y.1992.a4 106.8793485 56.07635245 100.6711004
Y.1992.a5 97.41402289 46.93434073 100.8787496
Y.1993.a1 150.045093 19.64290196 27.81953228
Y.1993.a2 106.5888189 21.38886866 84.82532249
Y.1993.a3 110.7493543 25.41765759 70.02222315
Y.1993.a4 237.1246502 16.43006029 75.17407065
Y.1993.a5 234.5403261 16.93082727 49.01639754
Y.1994.a1 94.5326718 24.72816922 151.665545
Y.1994.a2 106.8793485 56.07635245 100.6711004
Y.1994.a3 97.41402289 46.93434073 100.8787496
Y.1994.a4 150.045093 19.64290196 27.81953228
Y.1994.a5 106.5888189 21.38886866 84.82532249
每年我都有五次重复。问题是我怎样才能获得每一年的平均值(例如1992年和1993年以及1994年)?
答案 0 :(得分:3)
您可以使用base R
或使用dplyr
或data.table
等专门软件包执行此操作(当数据集非常大时效率更高)。
df$Year <- gsub("^.\\.(\\d+)\\..*", "\\1", row.names(df)) #extracted the year alone from the row names and created a column `Year` in the dataset
library(dplyr)
df %>%
group_by(Year) %>% #grouped by Year variable
summarise_each(funs(mean=mean(., na.rm=TRUE))) #when you specify the function, `summarise_each will applies the function (here it is mean) to each of the columns in the dataset or a subset of columns (if specified)
# Source: local data frame [3 x 4]
# Year LD.D LD.L LD.P
#1 1992 118.7717 48.12458 119.70139
#2 1993 167.8096 19.96206 61.37151
#3 1994 111.0920 33.75413 93.17205
使用data.table
。使用data.table
转换为setDT
,并使用lapply
S
ata.table(D
)列的.SD
mean
获取by
1}}。使用Year
指定分组变量 library(data.table)
setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=Year]
# Year LD.D LD.L LD.P
#1: 1992 118.7717 48.12458 119.70139
#2: 1993 167.8096 19.96206 61.37151
#3: 1994 111.0920 33.75413 93.17205
。
base R
或使用aggregate
。 by
,split
,by
等有不同的方式。这里有一个regex
。使用Year
(lookbehind)获取Y
。在这种情况下,我也会获得 Year <- gsub("(?<=[0-9])\\..*$", "", row.names(df), perl=TRUE)
do.call(`rbind`,by(df, Year, FUN= colMeans, na.rm=TRUE))
# LD.D LD.L LD.P
#Y.1992 118.7717 48.12458 119.70139
#Y.1993 167.8096 19.96206 61.37151
#Y.1994 111.0920 33.75413 93.17205
前缀,因为它不会影响结果。
df <- structure(list(LD.D = c(67.89552605, 227.1370541, 94.5326718,
106.8793485, 97.41402289, 150.045093, 106.5888189, 110.7493543,
237.1246502, 234.5403261, 94.5326718, 106.8793485, 97.41402289,
150.045093, 106.5888189), LD.L = c(33.21192862, 79.67211036,
24.72816922, 56.07635245, 46.93434073, 19.64290196, 21.38886866,
25.41765759, 16.43006029, 16.93082727, 24.72816922, 56.07635245,
46.93434073, 19.64290196, 21.38886866), LD.P = c(90.7750688,
154.5165077, 151.665545, 100.6711004, 100.8787496, 27.81953228,
84.82532249, 70.02222315, 75.17407065, 49.01639754, 151.665545,
100.6711004, 100.8787496, 27.81953228, 84.82532249)), .Names = c("LD.D",
"LD.L", "LD.P"), class = "data.frame", row.names = c("Y.1992.a1",
"Y.1992.a2", "Y.1992.a3", "Y.1992.a4", "Y.1992.a5", "Y.1993.a1",
"Y.1993.a2", "Y.1993.a3", "Y.1993.a4", "Y.1993.a5", "Y.1994.a1",
"Y.1994.a2", "Y.1994.a3", "Y.1994.a4", "Y.1994.a5"))
{{1}}
答案 1 :(得分:1)
尝试aggregate
其中DF
是数据框:
aggregate(DF, list(Year = gsub("^Y.|.[^.]*$", "", rownames(DF))), mean)