我有一个数据框,主题中的主题和coloumn中的变量,问题是每个主题有40行(因为每个主题有40个试验),所以不是每行对应每个主题。
我想要一个新的数据框,每行包含一个主题,并在coloumn中包含一些变量的平均值和中位数。
不幸的是我对R和编程语言相对较新,因为我从来没有管理for循环工作,我想这里我需要类似的东西。
有人可以提出一些方法吗?
这是我的数据。
Subject Trial File Int Target Synchrony corr_ans Risposta ACC RT
8 1 8 sinc2_lab579.wmv ASD sinc si 1 5 1 1418
9 1 9 asinc12_lab612.wmv ASD asinc no 0 1 1 1313
10 1 10 asinc9_lab879.wmv ASD asinc no 0 1 1 1460
11 1 11 asinc3_con13.wmv TD asinc no 0 2 1 2330
12 1 12 sinc11_con13.wmv TD sinc si 1 3 0 2003
13 1 13 sinc4_lab879.wmv ASD sinc si 1 5 1 2334
由于 莫罗
答案 0 :(得分:2)
继续关于Ananda Mahto的建议。 以下是聚合函数的一个简单示例:
> y
[,1] [,2] [,3]
[1,] 417.0761 3.656920 1
[2,] 549.2227 1.279305 1
[3,] 617.8346 2.676573 2
[4,] 445.3850 3.646215 2
[5,] 451.8529 4.337643 1
[6,] 391.7912 3.995142 2
# get mean and median by group (column 3 of y)
y.mean <- aggregate(y[,1:2], by=list(y[,3]), mean)
y.median <- aggregate(y[,1:2], by=list(y[,3]), median)
# merge y.mean and median by group, and label with suffix
y.summary <- merge(y.mean, y.median,by='Group.1', suffixes=c('mean','median'))
# print out result
print(summary)
Group.1 V1mean V2mean V1median V2median
1 1 472.7172 3.091289 451.8529 3.656920
2 2 485.0036 3.439310 445.3850 3.646215
答案 1 :(得分:2)
由于我不知道你的数据框是怎样的,我创建了一个简单的样本数据。以下是使用dplyr
的另一种方法。
#sample data frame
id <- rep(1:10, each = 40)
rt <- runif(400, 0.1, 1.5)
rt2 <- runif(400, 0.1,1.7)
foo <- data.frame(id, rt, rt2, stringsAsFactors = FALSE)
library(dplyr)
foo %>%
group_by(id) %>%
summarise_each(funs(mean = mean(., na.rm = TRUE),
median = median(., na.rm = TRUE)))
# id rt_mean rt2_mean rt_median rt2_median
#1 1 0.7217723 0.8612916 0.6722035 0.8950618
#2 2 0.7374311 0.8930941 0.6821156 0.8767759
#3 3 0.8419620 0.7738735 0.8913319 0.7270914
#4 4 0.8388703 1.0013907 0.7652657 1.1188743
#5 5 0.8680372 0.8122654 0.8801511 0.6933033
#6 6 0.8141279 0.9359209 0.9551427 0.9362919
#7 7 0.8091938 0.8359638 0.8469513 0.7844926
#8 8 0.7366915 0.7522470 0.7680704 0.6833661
#9 9 0.7470820 0.7840083 0.6487139 0.7460022
#10 10 0.7998107 0.6379467 0.8203582 0.5896608
更新:在评论中看到您尝试使用aggregate()
做什么,可以使用dplyr
执行此类操作。在这种情况下,您可以获得Risposta和RT的平均值和中位数。
mydf %>%
filter(Synchrony == "si") %>% # subset data with si only
group_by(Subject) %>%
summarise_each(funs(mean = mean(., na.rm = TRUE),
median = median(., na.rm = TRUE)),
Risposta, RT)
# I am missing the last row of the data here. So, the results should be
# slightly different with the full data set.
# Subject Risposta_mean RT_mean Risposta_median RT_median
#1 1 4 1710.5 4 1710.5
答案 2 :(得分:2)
将data.table与@jazzurro样本数据一起使用:
> library(data.table)
> foodt = data.table(foo)
> foodt[,list(mean.rt=mean(rt), median.rt=median(rt), mean.rt2=mean(rt2), median.rt2=median(rt2)),by=id]
id mean.rt median.rt mean.rt2 median.rt2
1: 1 0.8370809 0.7547919 0.8533929 0.8363765
2: 2 0.8050453 0.8131681 0.9579030 1.0284944
3: 3 0.8221798 0.8210501 0.9458442 1.0073267
4: 4 0.8491232 0.8463559 0.9728266 0.9574839
5: 5 0.7617457 0.7176411 0.9349860 0.9857195
6: 6 0.5956108 0.4745952 0.9008883 0.9105738
7: 7 0.8396380 0.7679036 0.8994247 0.9631958
8: 8 0.7882674 0.7532493 0.8935340 0.8600171
9: 9 0.8827633 0.9542983 0.9341739 0.8908895
10: 10 0.7579038 0.7140594 0.9200357 0.8963950
由于未使用set.seed,因此结果与@jazzurro结果不同。
答案 3 :(得分:1)
如果aggregate
有很多列,data.table
中的另一个选项是:
library(data.table) # using data.table_1.9.5, though it should work with earlier versions
nm1 <- c("Risposta", "RT") # subset of `colnames` of `mydf` from which `mean`, `median` etc are calculated.
如果您需要mean
,median
以获取数据集子集的上述列,即。 mydf$Synchrony=='si'
,然后
setDT(mydf)[Synchrony=='si', as.list(unlist(lapply(.SD, function(x)
list(mean=mean(x, na.rm=TRUE), median=median(x, na.rm=TRUE))))),
by=Subject,.SDcols=nm1]
# Subject Risposta.mean Risposta.median RT.mean RT.median
#1: 1 4.333333 5 1918.333 2003
在上面的代码setDT(mydf)
中,将data.frame
对象转换为data.table
。然后使用逻辑指示符Synchrony=='si'
仅对该条件的TRUE
行应用该函数。 .SD
表示S
ata.table的D
ubset。当我们指定.SDcols
时,我们使用lapply(.SD,..)
,它会创建一个列表,其中包含nm1
或.SDcols=nm1
中指定的列。如果您有多个功能,请使用list
加入,list(mean=mean(x,na.rm=TRUE), median=median(x,na.rm=TRUE))
,最后执行unlist(lapply(..
,as.list(
以获得宽格式的结果。
mydf <- structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L), Trial = 8:13,
File = c("sinc2_lab579.wmv", "asinc12_lab612.wmv", "asinc9_lab879.wmv",
"asinc3_con13.wmv", "sinc11_con13.wmv", "sinc4_lab879.wmv"
), Int = c("ASD", "ASD", "ASD", "TD", "TD", "ASD"), Target = c("sinc",
"asinc", "asinc", "asinc", "sinc", "sinc"), Synchrony = c("si",
"no", "no", "no", "si", "si"), corr_ans = c(1L, 0L, 0L, 0L,
1L, 1L), Risposta = c(5L, 1L, 1L, 2L, 3L, 5L), ACC = c(1L,
1L, 1L, 1L, 0L, 1L), RT = c(1418L, 1313L, 1460L, 2330L, 2003L,
2334L)), .Names = c("Subject", "Trial", "File", "Int", "Target",
"Synchrony", "corr_ans", "Risposta", "ACC", "RT"), class = "data.frame", row.names = c("8",
"9", "10", "11", "12", "13"))