R:创建行的子集(基于列名)以进行进一步分析

时间:2014-01-22 01:10:55

标签: r subset

我对R很新,所以这可能是一个愚蠢的问题。请忍受......

我们在研究中评估了参与者的注意力。每个参与者在两个条件之一中完成了365次试验;我们注意到了答案,准确性等。 现在,每列的第一行代表上面的标题:

participant_id  trial  condition  accuracy  etc.
 101            1         0        1       ... 
 101            2         0        1       ...
 101            3         0        0       ...
 102            1         3        1       ...
 102            2         3        0       ...

我想计算第一次和最近120次试验的总体平均准确度。注意:在365条路径中,前五条仅用于执行任务。因此,我希望获得试验6-125(前120)和246-365(后120)的总体准确性的描述(平均值,标准差等)。

我尝试使用subset()命令将数据拆分,但不确定它是否合适。还不确定最好的方法来计算我的钱财。

#split data.sub into first and last 120 trials

data.sub120=subset(data.sub, data.sub$trial== 6:125)
data.sub120last=subset (data.sub, data.sub$trial== 246:365)
stat.desc (data.sub120,data.sub120last)

任何帮助都将不胜感激 - 抱歉,如果我浪费任何人的时间,还在学习!

谢谢!

4 个答案:

答案 0 :(得分:1)

library(plyr)

# ddply takes a data.frame, splits by a variable, applies a fn,
# and returns everything back to a data.frame
results <- ddply(data.sub, .(participant_id), function(x) {
     # order the data by trial number
     x <- arrange(x, trial)
     # Take rows 6-25, and only columns 3 and 4 
     # since they are the only numeric ones in your example above, 
     # and apply the mean function to each column
     # turn that into a data.frame
     result <- data.frame(t(apply(x[6:125, c(3,4)], 2, mean)))
     # add the participant ID
     result$participant_id <- unique(x$participant_id)
     result
    })

答案 1 :(得分:1)

您可以使用不等式进行子集化:

## creating data for demonstration purposes

demo.data <- data.frame(participant.id = c(rep(101, 365), rep(102, 365), rep(103, 365)),
                        trial = c(1:365, 1:365, 1:365),
                        accuracy = rbinom(365*3, 1, 0.5))

## getting the first 120 trials
data.sub120 <- demo.data[demo.data$trial>5 & demo.data$trial<126,]

##getting the last 120 trials
data.sub120last <- demo.data[demo.data$trial>245 & demo.data$trial<366,]

##taking the means
mean(data.sub120$accuracy)
mean(data.sub120last$accuracy)

答案 2 :(得分:1)

我发现创建一个描述子集的变量并将其与我的数据一起存储以供将来使用是一种很好的做法。您将在以后感谢您能够重现分析的大部分内容(对您自己以对您具有内在意义的方式命名变量的奖励积分)

首先,让我们根据您的标准创建一个基本因素,并将其附加到您的数据集中:

mydata$trialsplit <- cut(mydata$trial,c(0,5,126,246,365), 
                    labels=c("Practice","First120","Middle","Last120")

我也是plyr包的粉丝所以我会以类似于Maiasaura的方式使用它。如果您只需要摘要表,则可以执行以下操作:

library(ddply)
ddply(mydata, .(trialsplit), summarize, 
      mean_condition = mean(condition),
      sd_condition = sd(condition),
      mean_accuracy = mean(accuracy),
      sd_accuracy = sd(accuracy)
)

如果您想将信息附加到数据而不是生成摘要,请将“汇总”一词更改为“转化”。

保存切割变量后,对数据进行统计测试现在变得非常简单:

# Does accuracy change from the first 120 to the last 120 trials?

t.test(mydata$accuracy[mydata$trialsplit == "First120"],
       mydata$accuracy[mydata$trialsplit == "Last120"])

答案 3 :(得分:1)

这是另一种解决方案,与Brandson使用data.table包一致。它比plyr更快,但我发现聚合问题的语法更直观。以下是进一步参考的documentation

demo.data <- data.frame(participant.id = c(rep(101, 365), rep(102, 365), rep(103, 365)),
                        trial = c(1:365, 1:365, 1:365),
                        condition = letters[1:5],
                        accuracy = rbinom(365*3, 1, 0.5))

require("data.table")
DT <- data.table(demo.data)

DT$fc_trial <- cut(DT$trial, breaks = c(0, 5, 126, 246, 365),
                   labels = c("Practice","First120","Middle","Last120"))

result <- DT[,j=list(mean_accuracy = mean(accuracy),
                     sd_accuracy = sd(accuracy)
                     )
             , by = fc_trial]
print(result)

#    fc_trial mean_accuracy sd_accuracy
# 1: Practice 0.6000000   0.5070926
# 2: First120 0.5151515   0.5004602
# 3:   Middle 0.5833333   0.4936928
# 4:  Last120 0.4677871   0.4996615