计算方法基于另一列的条件

时间:2014-03-18 18:43:16

标签: r

我有一个像

这样的数据框
df <- structure(list(DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
4L), .Label = c("04/23/90", "04/28/90", "05/03/95", "05/07/95"
), class = "factor"), JULIAN = c(113L, 113L, 113L, 113L, 113L, 
113L, 118L, 118L, 118L, 118L, 118L, 118L, 123L, 123L, 123L, 123L, 
123L, 123L, 127L, 127L, 127L, 127L, 127L, 127L), ID = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("AHFG-01", "AHFG-02", 
"AHFG-03", "OIUR-01", "OIUR-02", "OIUR-03"), class = "factor"), 
    PERCENT = c(0L, 0L, 0L, 80L, 55L, 0L, 25L, 50L, 75L, 100L, 
    75L, 45L, 10L, 20L, 30L, 50L, 50L, 50L, 50L, 60L, 70L, 75L, 
    90L, 95L)), .Names = c("DATE", "JULIAN", "ID", "PERCENT"), class = "data.frame", row.names = c(NA, 
-24L))

    DATE     JULIAN ID      PERCENT
1   04/23/90    113 AHFG-01 0
2   04/23/90    113 AHFG-02 0
3   04/23/90    113 AHFG-03 0
4   04/23/90    113 OIUR-01 80
5   04/23/90    113 OIUR-02 55
6   04/23/90    113 OIUR-03 0
7   04/28/90    118 AHFG-01 25
8   04/28/90    118 AHFG-02 50
9   04/28/90    118 AHFG-03 75
10  04/28/90    118 OIUR-01 100
11  04/28/90    118 OIUR-02 75
12  04/28/90    118 OIUR-03 45
13  05/03/95    123 AHFG-01 10
14  05/03/95    123 AHFG-02 20
15  05/03/95    123 AHFG-03 30
16  05/03/95    123 OIUR-01 50
17  05/03/95    123 OIUR-02 50
18  05/03/95    123 OIUR-03 50
19  05/07/95    127 AHFG-01 50
20  05/07/95    127 AHFG-02 60
21  05/07/95    127 AHFG-03 70
22  05/07/95    127 OIUR-01 75
23  05/07/95    127 OIUR-02 90
24  05/07/95    127 OIUR-03 95

在此数据框中,ID会在不同的网站上提供重复。例如,AHFG-01是复制1而AHFG-02是复制2,都位于网站AHFGPERCENT指的是完成百分比。

我需要计算两件事: 1) 每个网站JULIAN首次超过50时的平均值PERCENT 2) 所有网站的JULIAN首次超过50时的平均值PERCENT

我对在这里继续前进的最佳方式感到有些困惑。我的方法是: 1)在每个PERCENT / ID计算每个网站(DATE)的平均JULIAN 2)对于每个JULIAN的每个网站,当平均PERCENT首次超过50时,确定YEAR 3)多年来每个网站从 2)计算平均值JULIAN 4)计算所有网站多年来 2)的平均值JULIAN

对于上面的数据模型,我需要的网站和网站所需的最终结果如下所示:

SITE    JULIAN
AHFG    122.5
OIUR    120.5

JULIAN, all sites combined = 121.5

到目前为止我所做的是首先创建用于操作的列YEARSITE

df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)

然后我可以使用aggregate来计算SITE上述第1步的含义:

df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)

然而,我在第2步及以后陷入困境。任何人都可以建议一种方法来计算JULIAN首次超过50时的平均PERCENT,每年SITE多年,以及多年来所有SITE的合并?

解决方案:

这是Hekrik出色的解决方案的修改形式,对我有用。请注意,Henkik的原始解决方案确实有效,但我的问题有点不清楚我想要什么(见下面的评论)。

# make year column
df$DATE <- as.POSIXct(df$DATE, format='%m/%d/%y')
df$YEAR <- format(df$DATE, format='%Y')

# make new ID column (remove numbers for individuals)
df$SITE <- gsub("[^aA-zZ]", " ", df$ID)

# Calculate average PERCENT for each SITE
df2 <- aggregate(PERCENT ~ SITE + JULIAN + YEAR,FUN=mean,data=df)

# order by SITE and JULIAN
df2 <- df2[order(df2$SITE, df2$JULIAN), ]

# within each YEAR and SITE, select first registration where PERCENT is 50 or more
df2 <- do.call(rbind,
               by(df2, list(df2$YEAR, df2$SITE), function(x){
                 x[x$PERCENT >= 50, ][1, ]
               }))

# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)

# overall mean
mean(df2$JULIAN)

1 个答案:

答案 0 :(得分:1)

这是一种可能性:

# order by SITE and DATE
df <- df[order(df$SITE, df$DATE), ]


# within each YEAR and SITE, select first registration where PERCENT exceeds 50
df2 <- do.call(rbind,
               by(df, list(df$YEAR, df$SITE), function(x){
                 x[x$PERCENT > 50, ][1, ]
               }))
df2
#          DATE JULIAN      ID PERCENT YEAR SITE
# 6  1990-04-28    118 AHFG-03      75 1990 AHFG
# 11 1995-05-07    127 AHFG-02      60 1995 AHFG
# 13 1990-04-23    113 OIUR-01      80 1990 OIUR
# 22 1995-05-07    127 OIUR-01      75 1995 OIUR


# calculate mean JULIAN per SITE
aggregate(JULIAN ~ SITE, data = df2, mean)
#   SITE JULIAN
# 1 AHFG  122.5
# 2 OIUR  120.0


# overall mean
mean(df2$JULIAN)
# [1] 121.25

请注意,我没有像你的例子那样得到OIUR的意思。