我有一个包含三个变量的数据框:治疗,剂量和结果(正负)。我对每种治疗和剂量都有多个观察结果。我正在尝试输出一个列联数据表,该数据表会使数据崩溃,以指示每个结果的数量与治疗和剂量以及观察次数的函数关系。例如:
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1
所需的输出是:
treatment dose outcome n
control 0 0 1 4
treatmentA 1 2 3
treatmentA 2 3 3
我整天都玩这个,并且除了能够获得每个观察结果的每个结果的频率之外没有多少运气。我会忽略任何建议(包括指出R手册和/或示例的哪些部分)。
谢谢!
[R
答案 0 :(得分:5)
这是一个使用精彩包data.table
的解决方案:
library(data.table)
x <- data.table(read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1", header = TRUE)
x[, list(outcome = sum(outcome), count = .N), by = 'treatment,dose']
产生
treatment dose outcome count
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
答案 1 :(得分:4)
如果您不想按照其他答案中的建议使用额外的库,可以尝试以下操作。
> df
treatment dose outcome
1 control 0 0
2 control 0 0
3 control 0 0
4 control 0 1
5 treatmentA 1 0
6 treatmentA 1 1
7 treatmentA 1 1
8 treatmentA 2 1
9 treatmentA 2 1
10 treatmentA 2 1
> dput(df)
structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("control", "treatmentA"), class = "factor"),
dose = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 2L), outcome = c(0L,
0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), .Names = c("treatment",
"dose", "outcome"), class = "data.frame", row.names = c(NA, -10L
))
现在我们使用aggregate
函数来获取outcome
列
> nObs <- aggregate(outcome ~ treatment + dose, data = df, length)
> sObs <- aggregate(outcome ~ treatment + dose, data = df, sum)
适当更改汇总列的名称
姓名(nObs)&lt; - c('治疗','剂量','计数')
> names(sObs) <- c('treatment', 'dose', 'sum')
> nObs
treatment dose count
1 control 0 4
2 treatmentA 1 3
3 treatmentA 2 3
> sObs
treatment dose sum
1 control 0 1
2 treatmentA 1 2
3 treatmentA 2 3
在这种情况下,使用merge
将所有列的相同名称treatment
和dose
组合在一起。
> result <- merge(nObs, sObs)
> result
treatment dose count sum
1 control 0 4 1
2 treatmentA 1 3 2
3 treatmentA 2 3 3
答案 2 :(得分:3)
如果我理解正确,data.table
库就可以直截了当了。首先,加载库并读取数据:
library(data.table)
data <- read.table(header=TRUE, text="
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1")
接下来,创建一个data.table
,其中treatment
和dose
列作为表键(索引)。
data <- data.table(data, key="treatment,dose")
然后使用data.table
语法汇总。
data[, list(outcome=sum(outcome), n=length(outcome)), by=list(treatment,dose)]
treatment dose outcome n
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
答案 3 :(得分:2)
# read in your example data as `x`
x <- read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1",h=T)
# load the sql data frame library
library(sqldf)
# create a new table of all unique `treatment` and `dose` columns,
# summing the `outcome` column and
# counting the number of records in each combo
y <- sqldf( 'SELECT treatment, dose ,
sum( outcome ) as outcome ,
count(*) as n
FROM x
GROUP BY treatment, dose' )
# check the results
y
答案 4 :(得分:2)
以下是另外两个选项(甚至认为data.table
方法在语法简洁方面明显胜出)。
第一个使用ave
中的within
。 ave
可以将函数应用于由一个或多个变量分组的变量(提到的第一个变量)。在删除现在不必要的“结果”列之后,我们将输出包装在unique
中。
unique(within(df, {
SUM <- ave(outcome, treatment, dose, FUN = sum)
COUNT <- ave(outcome, treatment, dose, FUN = length)
rm(outcome)
}))
# treatment dose COUNT SUM
# 1 control 0 4 1
# 5 treatmentA 1 3 2
# 8 treatmentA 2 3 3
基础R中的第二个解决方案与@ geektrader的答案非常相似,只是它在一次聚合调用中计算sum
和length
。但是有一个“缺点”:cbind
的结果是data.frame
中的“列”实际上是一个矩阵。查看str
的结果,了解我的意思。
temp <- aggregate(outcome ~ treatment + dose, df,
function(x) cbind(sum(x), length(x)))
str(temp)
# 'data.frame': 3 obs. of 3 variables:
# $ treatment: Factor w/ 2 levels "control","treatmentA": 1 2 2
# $ dose : int 0 1 2
# $ outcome : int [1:3, 1:2] 1 2 3 4 3 3
colnames(temp$outcome) <- c("SUM", "COUNT")
temp
# treatment dose outcome.SUM outcome.COUNT
# 1 control 0 1 4
# 2 treatmentA 1 2 3
# 3 treatmentA 2 3 3
我提到存储结构是一个“缺点”,主要是因为当您尝试以您可能习惯的方式访问数据时,可能无法获得预期结果。
temp$outcome.SUM
# NULL
temp$outcome
# SUM COUNT
# [1,] 1 4
# [2,] 2 3
# [3,] 3 3
相反,您必须通过以下方式访问它:
temp$outcome[, "SUM"] ## or temp$outcome[, 1]
# [1] 1 2 3