怎么办在R中一个循环,它计算行的平均值和中位数,并将其添加到数据框中

时间:2014-10-06 01:49:27

标签: r for-loop dataframe

我有一个数据框,主题中的主题和coloumn中的变量,问题是每个主题有40行(因为每个主题有40个试验),所以不是每行对应每个主题。

我想要一个新的数据框,每行包含一个主题,并在coloumn中包含一些变量的平均值和中位数。

不幸的是我对R和编程语言相对较新,因为我从来没有管理for循环工作,我想这里我需要类似的东西。

有人可以提出一些方法吗?

这是我的数据。

    Subject Trial         File    Int   Target    Synchrony     corr_ans Risposta ACC       RT
8          1     8   sinc2_lab579.wmv ASD   sinc        si              1        5     1       1418
9          1     9 asinc12_lab612.wmv ASD   asinc       no              0        1     1       1313
10         1    10  asinc9_lab879.wmv ASD   asinc       no              0        1     1       1460
11         1    11   asinc3_con13.wmv TD    asinc       no              0        2     1       2330
12         1    12   sinc11_con13.wmv TD    sinc        si              1        3     0       2003
13         1    13   sinc4_lab879.wmv ASD   sinc        si              1        5     1       2334

由于 莫罗

4 个答案:

答案 0 :(得分:2)

继续关于Ananda Mahto的建议。 以下是聚合函数的一个简单示例:

> y
     [,1]     [,2] [,3]
[1,] 417.0761 3.656920    1
[2,] 549.2227 1.279305    1
[3,] 617.8346 2.676573    2
[4,] 445.3850 3.646215    2
[5,] 451.8529 4.337643    1
[6,] 391.7912 3.995142    2


# get mean and median by group (column 3 of y)
y.mean   <- aggregate(y[,1:2], by=list(y[,3]), mean)
y.median <- aggregate(y[,1:2], by=list(y[,3]), median)

# merge y.mean and median by group, and label with suffix
y.summary <- merge(y.mean, y.median,by='Group.1', suffixes=c('mean','median'))

# print out result
print(summary)

  Group.1   V1mean   V2mean V1median V2median
1       1 472.7172 3.091289 451.8529 3.656920
2       2 485.0036 3.439310 445.3850 3.646215

答案 1 :(得分:2)

由于我不知道你的数据框是怎样的,我创建了一个简单的样本数据。以下是使用dplyr的另一种方法。

#sample data frame
id <- rep(1:10, each = 40)
rt <- runif(400, 0.1, 1.5)
rt2 <- runif(400, 0.1,1.7)
foo <- data.frame(id, rt, rt2, stringsAsFactors = FALSE)

library(dplyr)

foo %>%
    group_by(id) %>%
    summarise_each(funs(mean = mean(., na.rm = TRUE),
                   median = median(., na.rm = TRUE)))

#   id   rt_mean  rt2_mean rt_median rt2_median
#1   1 0.7217723 0.8612916 0.6722035  0.8950618
#2   2 0.7374311 0.8930941 0.6821156  0.8767759
#3   3 0.8419620 0.7738735 0.8913319  0.7270914
#4   4 0.8388703 1.0013907 0.7652657  1.1188743
#5   5 0.8680372 0.8122654 0.8801511  0.6933033
#6   6 0.8141279 0.9359209 0.9551427  0.9362919
#7   7 0.8091938 0.8359638 0.8469513  0.7844926
#8   8 0.7366915 0.7522470 0.7680704  0.6833661
#9   9 0.7470820 0.7840083 0.6487139  0.7460022
#10 10 0.7998107 0.6379467 0.8203582  0.5896608

更新:在评论中看到您尝试使用aggregate()做什么,可以使用dplyr执行此类操作。在这种情况下,您可以获得Risposta和RT的平均值和中位数。

mydf %>%
    filter(Synchrony == "si") %>% # subset data with si only
    group_by(Subject) %>%
    summarise_each(funs(mean = mean(., na.rm = TRUE),
                        median = median(., na.rm = TRUE)),
                        Risposta, RT)

#  I am missing the last row of the data here. So, the results should be
#  slightly different with the full data set.
#  Subject Risposta_mean RT_mean Risposta_median RT_median
#1       1             4  1710.5               4    1710.5

答案 2 :(得分:2)

将data.table与@jazzurro样本数据一起使用:

> library(data.table)
> foodt = data.table(foo)
> foodt[,list(mean.rt=mean(rt), median.rt=median(rt), mean.rt2=mean(rt2), median.rt2=median(rt2)),by=id]
    id   mean.rt median.rt  mean.rt2 median.rt2
 1:  1 0.8370809 0.7547919 0.8533929  0.8363765
 2:  2 0.8050453 0.8131681 0.9579030  1.0284944
 3:  3 0.8221798 0.8210501 0.9458442  1.0073267
 4:  4 0.8491232 0.8463559 0.9728266  0.9574839
 5:  5 0.7617457 0.7176411 0.9349860  0.9857195
 6:  6 0.5956108 0.4745952 0.9008883  0.9105738
 7:  7 0.8396380 0.7679036 0.8994247  0.9631958
 8:  8 0.7882674 0.7532493 0.8935340  0.8600171
 9:  9 0.8827633 0.9542983 0.9341739  0.8908895
10: 10 0.7579038 0.7140594 0.9200357  0.8963950

由于未使用set.seed,因此结果与@jazzurro结果不同。

答案 3 :(得分:1)

如果aggregate有很多列,data.table中的另一个选项是:

 library(data.table) # using data.table_1.9.5, though it should work with earlier versions

 nm1 <- c("Risposta", "RT") # subset of `colnames` of `mydf` from which `mean`, `median` etc are calculated. 

如果您需要meanmedian以获取数据集子集的上述列,即。 mydf$Synchrony=='si',然后

 setDT(mydf)[Synchrony=='si', as.list(unlist(lapply(.SD, function(x)
            list(mean=mean(x, na.rm=TRUE), median=median(x, na.rm=TRUE))))),
                           by=Subject,.SDcols=nm1] 

 #   Subject Risposta.mean Risposta.median  RT.mean RT.median
#1:       1      4.333333               5 1918.333      2003

在上面的代码setDT(mydf)中,将data.frame对象转换为data.table。然后使用逻辑指示符Synchrony=='si'仅对该条件的TRUE行应用该函数。 .SD表示S ata.table的D ubset。当我们指定.SDcols时,我们使用lapply(.SD,..),它会创建一个列表,其中包含nm1.SDcols=nm1中指定的列。如果您有多个功能,请使用list加入,list(mean=mean(x,na.rm=TRUE), median=median(x,na.rm=TRUE)),最后执行unlist(lapply(..as.list(以获得宽格式的结果。

数据

mydf <-  structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L), Trial = 8:13, 
File = c("sinc2_lab579.wmv", "asinc12_lab612.wmv", "asinc9_lab879.wmv", 
"asinc3_con13.wmv", "sinc11_con13.wmv", "sinc4_lab879.wmv"
), Int = c("ASD", "ASD", "ASD", "TD", "TD", "ASD"), Target = c("sinc", 
"asinc", "asinc", "asinc", "sinc", "sinc"), Synchrony = c("si", 
"no", "no", "no", "si", "si"), corr_ans = c(1L, 0L, 0L, 0L, 
1L, 1L), Risposta = c(5L, 1L, 1L, 2L, 3L, 5L), ACC = c(1L, 
1L, 1L, 1L, 0L, 1L), RT = c(1418L, 1313L, 1460L, 2330L, 2003L, 
2334L)), .Names = c("Subject", "Trial", "File", "Int", "Target", 
"Synchrony", "corr_ans", "Risposta", "ACC", "RT"), class = "data.frame", row.names = c("8", 
"9", "10", "11", "12", "13"))