R:每个参与者的平均值,直到column_date

时间:2017-01-16 23:20:26

标签: r aggregate average

我是R的新手,想要解决这个问题:在我的附表中,我需要计算每个参与者的平均值,直到特定的column_date。 I. e。直到2015-08-30彼得在5个条目中得到4分,所以在右边的一个新行中,一个列字段需要等于4/5等等......

我使用聚合进行了一些计算,但只得到每个参与者名称组的平均值...

提前致谢!!

       Date Participant Right/Wrong
 2013-01-02       Peter           1
 2015-01-05    Caroline           1
 2015-02-03        Jack           0
 2015-03-05    Jennifer           0
 2015-03-09       Peter           1
 2016-04-14    Jennifer           0
 2015-04-16    Caroline           1
 2015-06-02    Jennifer           1
 2015-06-05       Peter           1
 2015-06-10    Caroline           0
 2015-07-10        Jack           1
 2015-08-01    Jennifer           0
 2015-08-05       Peter           0
 2015-07-14        Jack           1
 2015-08-30       Peter           1
 2015-12-14    Jennifer           1
 2015-12-24        Jack           1
 2015-12-27       Peter           1
 2015-12-30    Caroline           1

3 个答案:

答案 0 :(得分:2)

注意:我在下面添加了html表数据,现在已经从你的问题中删除了。

library('XML')
doc <- htmlParse(xml_content)
df1 <- readHTMLTable(doc)
df1 <- df1[[1]]
df1$Date <- as.Date(as.character(df1$Date))
df1$Participant <- as.character(df1$Participant)
df1$`Right/Wrong` <- as.numeric(as.character(df1$`Right/Wrong`))

使用Base R(不需要包)

a1 <- with(df1, 
           by(data = df1, 
              INDICES = Participant, 
              FUN = function(x) list(Participant = x$Participant,
                                     Date = x$Date, 
                                     cumsum = cumsum(x$`Right/Wrong`),
                                     cummean = cumsum(x$`Right/Wrong`)/sum(x$`Right/Wrong`))))

rownames(a1) <- NULL  # remove row names

do.call("rbind", lapply(a1, function(x) data.frame(x)))

使用data.table库

library('data.table')
setDT(df1)[, .(cumsum = cumsum(`Right/Wrong`), cummean = cumsum(`Right/Wrong`)/sum(`Right/Wrong`), Date), by = c('Participant')]
#    Participant cumsum   cummean       Date
# 1:       Peter      1 0.2000000 2013-01-02
# 2:       Peter      2 0.4000000 2015-03-09
# 3:       Peter      3 0.6000000 2015-06-05
# 4:       Peter      3 0.6000000 2015-08-05
# 5:       Peter      4 0.8000000 2015-08-30
# 6:       Peter      5 1.0000000 2015-12-27
# 7:    Caroline      1 0.3333333 2015-01-05
# 8:    Caroline      2 0.6666667 2015-04-16
# 9:    Caroline      2 0.6666667 2015-06-10
# 10:    Caroline      3 1.0000000 2015-12-30
# 11:        Jack      0 0.0000000 2015-02-03
# 12:        Jack      1 0.3333333 2015-07-10
# 13:        Jack      2 0.6666667 2015-07-14
# 14:        Jack      3 1.0000000 2015-12-24
# 15:    Jennifer      0 0.0000000 2015-03-05
# 16:    Jennifer      0 0.0000000 2016-04-14
# 17:    Jennifer      1 0.5000000 2015-06-02
# 18:    Jennifer      1 0.5000000 2015-08-01
# 19:    Jennifer      2 1.0000000 2015-12-14

数据:

xml_content <- '<style type="text/css">
  .tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-031e">Date</th>
<th class="tg-031e">Participant</th>
<th class="tg-031e">Right/Wrong</th>
</tr>
<tr>
<td class="tg-031e">2013-01-02</td>
<td class="tg-031e">Peter</td>
<td class="tg-031e">1</td>
</tr>
<tr>
<td class="tg-031e">2015-01-05</td>
<td class="tg-031e">Caroline</td>
<td class="tg-031e">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-02-03</td>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">0</td>
</tr>
<tr>
<td class="tg-yw4l">2015-03-05</td>
<td class="tg-yw4l">Jennifer</td>
<td class="tg-yw4l">0</td>
</tr>
<tr>
<td class="tg-yw4l">2015-03-09</td>
<td class="tg-yw4l">Peter</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2016-04-14</td>
<td class="tg-yw4l">Jennifer</td>
<td class="tg-yw4l">0</td>
</tr>
<tr>
<td class="tg-yw4l">2015-04-16</td>
<td class="tg-yw4l">Caroline</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-06-02</td>
<td class="tg-yw4l">Jennifer</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-06-05</td>
<td class="tg-yw4l">Peter</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-06-10</td>
<td class="tg-yw4l">Caroline</td>
<td class="tg-yw4l">0</td>
</tr>
<tr>
<td class="tg-yw4l">2015-07-10</td>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-08-01</td>
<td class="tg-yw4l">Jennifer</td>
<td class="tg-yw4l">0</td>
</tr>
<tr>
<td class="tg-yw4l">2015-08-05</td>
<td class="tg-yw4l">Peter</td>
<td class="tg-yw4l">0</td>
</tr>
<tr>
<td class="tg-yw4l">2015-07-14</td>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-08-30</td>
<td class="tg-yw4l">Peter</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-12-14</td>
<td class="tg-yw4l">Jennifer</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-12-24</td>
<td class="tg-yw4l">Jack</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-12-27</td>
<td class="tg-yw4l">Peter</td>
<td class="tg-yw4l">1</td>
</tr>
<tr>
<td class="tg-yw4l">2015-12-30</td>
<td class="tg-yw4l">Caroline</td>
<td class="tg-yw4l">1</td>
</tr>
</table>'

答案 1 :(得分:0)

您可以尝试:

数据

participants <- structure(list(Date = structure(c(1L, 2L, 3L, 4L, 5L, 19L, 6L,
7L, 8L, 9L, 10L, 12L, 13L, 11L, 14L, 15L, 16L, 17L, 18L), .Label = c("2013-01-02",
"2015-01-05", "2015-02-03", "2015-03-05", "2015-03-09", "2015-04-16",
"2015-06-02", "2015-06-05", "2015-06-10", "2015-07-10", "2015-07-14",
"2015-08-01", "2015-08-05", "2015-08-30", "2015-12-14", "2015-12-24",
"2015-12-27", "2015-12-30", "2016-04-14"), class = "factor"),
    Participant = structure(c(4L, 1L, 2L, 3L, 4L, 3L, 1L, 3L,
    4L, 1L, 2L, 3L, 4L, 2L, 4L, 3L, 2L, 4L, 1L), .Label = c("Caroline",
    "Jack", "Jennifer", "Peter"), class = "factor"), Right.Wrong = c(1L,
    1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L,
    1L, 1L, 1L)), .Names = c("Date", "Participant", "Right.Wrong"
), class = "data.frame", row.names = c(NA, -19L))

代码:

#dplyr
#install.packages('dplyr')
library(dplyr)

participants %>%
  mutate(Date = as.POSIXct(Date, "%Y-%m-%d", tz = Sys.timezone())) %>%
  group_by(Participant) %>%
  dplyr::filter(Date <= as.POSIXct('2015-08-30', "%Y-%m-%d", tz = Sys.timezone())) %>%
  summarise(Right.Wrong = mean(Right.Wrong))


# Or base R
participants$Date <- as.POSIXct(participants$Date, "%Y-%m-%d", tz = Sys.timezone())
aggregate(Right.Wrong ~ Participant, data = participants, 
          subset = participants$Date <= as.POSIXct('2015-08-30', "%Y-%m-%d", tz = Sys.timezone()), 
          FUN = mean)

这两个都应该产生如下内容:

 Participant Right.Wrong
 Caroline    0.6666667  
 Jack        0.6666667  
 Jennifer    0.3333333  
 Peter       0.8000000  

答案 2 :(得分:0)

您可以使用subsetaggregate功能。对于您的数据:

首先,您可以将数据框子集到您想要的日期:

df2<-subset(yourData, yourData$Date < as.Date("2015-08-30"))

其次,您可以看到每个参与者在此日期之前有多少分数:

Points <- aggregate(df2$'Right/Wrong', by=list(df2$Participant), sum) 

或者如果你想要平均值:

Points <- aggregate(df2$'Right/Wrong', by=list(df2$Participant), mean)