我有一个像
这样的数据框df <- structure(list(DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), .Label = c("04/23/90", "04/28/90", "05/03/95", "05/07/95"
), class = "factor"), JULIAN = c(113L, 113L, 113L, 113L, 113L,
113L, 118L, 118L, 118L, 118L, 118L, 118L, 123L, 123L, 123L, 123L,
123L, 123L, 127L, 127L, 127L, 127L, 127L, 127L), ID = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L,
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("AHFG-01", "AHFG-02",
"AHFG-03", "OIUR-01", "OIUR-02", "OIUR-03"), class = "factor"),
PERCENT = c(0L, 0L, 0L, 80L, 55L, 0L, 25L, 50L, 75L, 100L,
75L, 45L, 10L, 20L, 30L, 50L, 50L, 50L, 50L, 60L, 70L, 75L,
90L, 95L)), .Names = c("DATE", "JULIAN", "ID", "PERCENT"), class = "data.frame", row.names = c(NA,
-24L))
DATE JULIAN ID PERCENT
1 04/23/90 113 AHFG-01 0
2 04/23/90 113 AHFG-02 0
3 04/23/90 113 AHFG-03 0
4 04/23/90 113 OIUR-01 80
5 04/23/90 113 OIUR-02 55
6 04/23/90 113 OIUR-03 0
7 04/28/90 118 AHFG-01 25
8 04/28/90 118 AHFG-02 50
9 04/28/90 118 AHFG-03 75
10 04/28/90 118 OIUR-01 100
11 04/28/90 118 OIUR-02 75
12 04/28/90 118 OIUR-03 45
13 05/03/95 123 AHFG-01 10
14 05/03/95 123 AHFG-02 20
15 05/03/95 123 AHFG-03 30
16 05/03/95 123 OIUR-01 50
17 05/03/95 123 OIUR-02 50
18 05/03/95 123 OIUR-03 50
19 05/07/95 127 AHFG-01 50
20 05/07/95 127 AHFG-02 60
21 05/07/95 127 AHFG-03 70
22 05/07/95 127 OIUR-01 75
23 05/07/95 127 OIUR-02 90
24 05/07/95 127 OIUR-03 95
在此数据框中,ID
会在不同的网站上提供重复。例如,AHFG-01
是复制1而AHFG-02
是复制2,都位于网站AHFG
。 PERCENT
指的是完成百分比。
我需要计算两件事:
1) 每个网站JULIAN
首次超过50时的平均值PERCENT
2) 所有网站的JULIAN
首次超过50时的平均值PERCENT
我对在这里继续前进的最佳方式感到有些困惑。我的方法是:
1)在每个PERCENT
/ ID
计算每个网站(DATE
)的平均JULIAN
2)对于每个JULIAN
的每个网站,当平均PERCENT
首次超过50时,确定YEAR
3)多年来每个网站从 2)计算平均值JULIAN
4)计算所有网站多年来 2)的平均值JULIAN
对于上面的数据模型,我需要的网站和网站所需的最终结果如下所示:
SITE JULIAN
AHFG 122.5
OIUR 120.5
JULIAN, all sites combined = 121.5
到目前为止我所做的是首先创建用于操作的列YEAR
和SITE
:
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
然后我可以使用aggregate
来计算SITE
上述第1步的含义:
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
然而,我在第2步及以后陷入困境。任何人都可以建议一种方法来计算JULIAN
首次超过50时的平均PERCENT
,每年SITE
多年,以及多年来所有SITE
的合并?
解决方案:
这是Hekrik出色的解决方案的修改形式,对我有用。请注意,Henkik的原始解决方案确实有效,但我的问题有点不清楚我想要什么(见下面的评论)。
# make year column
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
# make new ID column (remove numbers for individuals)
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)
# Calculate average PERCENT for each SITE
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)
# order by SITE and JULIAN
df2 <- df2[order(df2$SITE, df2$JULIAN), ]
# within each YEAR and SITE, select first registration where PERCENT is 50 or more
df2 <- do.call(rbind,
by(df2, list(df2$YEAR, df2$SITE), function(x){
x[x$PERCENT >= 50, ][1, ]
}))
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# overall mean
mean(df2$JULIAN)
答案 0 :(得分:1)
这是一种可能性:
# order by SITE and DATE
df <- df[order(df$SITE, df$DATE), ]
# within each YEAR and SITE, select first registration where PERCENT exceeds 50
df2 <- do.call(rbind,
by(df, list(df$YEAR, df$SITE), function(x){
x[x$PERCENT > 50, ][1, ]
}))
df2
# DATE JULIAN ID PERCENT YEAR SITE
# 6 1990-04-28 118 AHFG-03 75 1990 AHFG
# 11 1995-05-07 127 AHFG-02 60 1995 AHFG
# 13 1990-04-23 113 OIUR-01 80 1990 OIUR
# 22 1995-05-07 127 OIUR-01 75 1995 OIUR
# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
# SITE JULIAN
# 1 AHFG 122.5
# 2 OIUR 120.0
# overall mean
mean(df2$JULIAN)
# [1] 121.25
请注意,我没有像你的例子那样得到OIUR的意思。