所以我有一些格式如下的数据:
header1 header2
"nocandy" "nocandy"
"nocandy" "nocandy"
"nocandy" "nocandy"
"nocandy" "candy"
"nocandy" "candy"
"candy" "candy"
etc...
我用candytext <- read.table("candytest.txt", header=TRUE)
导入了它
我想做一个卡方检验,看看两组之间是否存在差异。
当我使用函数table(candytest)
时,我会得到这样的结果:
header2
header1 candy nocandy
candy 112 39
nocandy 4 82
但如果我运行summary(candytest)
,我会得到类似的结果:
header1 header2
candy :151 candy :116
nocandy: 86 nocandy:121
如您所见,这两个表的格式不同。但是,我可以在第一个表上运行chisquared测试,但不能在第二个表上运行。但是,摘要表更像是我需要用来执行chisq.test()
的表。第二个表看起来假设数据已配对,但数据未配对。如果配对就可以了,我可以在table(candytest)
的输出上使用McNemars测试,但它没有配对。那么如何创建一个看起来像摘要表的2×2矩阵,而无需手动输入。我意识到我可以将汇总表复制到矩阵中,但是我想知道如何在R中将其转换为函数。
谢谢!
答案 0 :(得分:1)
在这里,我尝试使用summary
在df1
的每列上获取lapply
,假设列classes
是因素。从帖子中,我猜是这样的。在do.call(data.frame
输出上使用list
,将其转换为data.frame
。
do.call(data.frame,lapply(df1, summary)) #in case a matrix output is needed, just replace `data.frame` with `cbind`
# header1 header2
#candy 1 3
#nocandy 5 3
summary(df1)
# header1 header2
#candy :1 candy :3
#nocandy:5 nocandy:3
如果您只需要数据集中许多列的选定列,
nm1 <- paste0("header",1:2) #names of columns to do the summary
do.call(`cbind`, lapply(df1[nm1], summary))
# header1 header2
#candy 1 3
#nocandy 5 3
您也可以使用summary
data.table
library(data.table)
DT <- setDT(df1)[, lapply(.SD, summary)] #or
#DT <- setDT(df1)[, lapply(.SD, table)]
DT
# header1 header2
#1: 1 3
#2: 5 3
chisq.test(DT)
# Pearson's Chi-squared test with Yates' continuity correction
#data: DT
#X-squared = 0.375, df = 1, p-value = 0.5403
#Warning message:
#In chisq.test(DT) : Chi-squared approximation may be incorrect
df1 <- structure(list(header1 = structure(c(2L, 2L, 2L, 2L, 2L, 1L), .Label = c("candy",
"nocandy"), class = "factor"), header2 = structure(c(2L, 2L,
2L, 1L, 1L, 1L), .Label = c("candy", "nocandy"), class = "factor")), .Names = c("header1",
"header2"), row.names = c(NA, -6L), class = "data.frame")
答案 1 :(得分:1)
听起来您希望将列视为独立样本。如果是这样,这可能不是最好的数据结构。但你可以做到
#sample data
candytext<-read.table(text='header1 header2
"nocandy" "nocandy"
"nocandy" "nocandy"
"nocandy" "nocandy"
"nocandy" "candy"
"nocandy" "candy"
"candy" "candy"', header=T)
#summarize
do.call(cbind, lapply(candytext, table))
# header1 header2
# candy 1 3
# nocandy 5 3
答案 2 :(得分:1)
尝试:
> dd = data.frame(sapply(candytext, summary))
> dd
header1 header2
candy 1 3
nocandy 5 3
> chisq.test(dd)
Pearson's Chi-squared test with Yates' continuity correction
data: dd
X-squared = 0.375, df = 1, p-value = 0.5403
Warning message:
In chisq.test(dd) : Chi-squared approximation may be incorrect
>
如果要从多列数据框中选择2列:
> cc = cbind(summary(candytext$header1), summary(candytext$header2))
> cc
[,1] [,2]
candy 1 3
nocandy 5 3
> chisq.test(cc)
Pearson's Chi-squared test with Yates' continuity correction
data: cc
X-squared = 0.375, df = 1, p-value = 0.5403
Warning message:
In chisq.test(cc) : Chi-squared approximation may be incorrect
在下面的表格中,表格和摘要是相同的:
> cbind(table(candytext$header1), table(candytext$header2))
[,1] [,2]
candy 1 3
nocandy 5 3
>
> cbind(summary(candytext$header1), summary(candytext$header2))
[,1] [,2]
candy 1 3
nocandy 5 3