R Chi-Squared表格式

时间:2014-10-14 03:35:16

标签: r statistics

所以我有一些格式如下的数据:

header1    header2
"nocandy"  "nocandy"
"nocandy"  "nocandy"
"nocandy"  "nocandy"
"nocandy"    "candy"
"nocandy"    "candy"
"candy"    "candy"
etc...

我用candytext <- read.table("candytest.txt", header=TRUE)导入了它 我想做一个卡方检验,看看两组之间是否存在差异。 当我使用函数table(candytest)时,我会得到这样的结果:

         header2
header1   candy nocandy
  candy     112      39
  nocandy     4      82

但如果我运行summary(candytest),我会得到类似的结果:

    header1       header2   
 candy  :151   candy  :116  
 nocandy: 86   nocandy:121 

如您所见,这两个表的格式不同。但是,我可以在第一个表上运行chisquared测试,但不能在第二个表上运行。但是,摘要表更像是我需要用来执行chisq.test()的表。第二个表看起来假设数据已配对,但数据未配对。如果配对就可以了,我可以在table(candytest)的输出上使用McNemars测试,但它没有配对。那么如何创建一个看起来像摘要表的2×2矩阵,而无需手动输入。我意识到我可以将汇总表复制到矩阵中,但是我想知道如何在R中将其转换为函数。

谢谢!

3 个答案:

答案 0 :(得分:1)

在这里,我尝试使用summarydf1的每列上获取lapply,假设列classes是因素。从帖子中,我猜是这样的。在do.call(data.frame输出上使用list,将其转换为data.frame

  do.call(data.frame,lapply(df1, summary)) #in case a matrix output is needed, just replace `data.frame` with `cbind`
  #          header1 header2
  #candy         1       3
  #nocandy       5       3


  summary(df1)
  #   header1     header2 
  #candy  :1   candy  :3  
  #nocandy:5   nocandy:3  

如果您只需要数据集中许多列的选定列,

  nm1 <- paste0("header",1:2) #names of columns to do the summary
   do.call(`cbind`, lapply(df1[nm1], summary))
   #        header1 header2
   #candy         1       3
   #nocandy       5       3

您也可以使用summary

进行data.table
  library(data.table)
  DT <- setDT(df1)[, lapply(.SD, summary)]   #or

 #DT <- setDT(df1)[, lapply(.SD, table)] 
  DT
   #    header1 header2
   #1:       1       3
   #2:       5       3

 chisq.test(DT)

 #    Pearson's Chi-squared test with Yates' continuity correction

  #data:  DT
  #X-squared = 0.375, df = 1, p-value = 0.5403

  #Warning message:
  #In chisq.test(DT) : Chi-squared approximation may be incorrect

数据

df1 <- structure(list(header1 = structure(c(2L, 2L, 2L, 2L, 2L, 1L), .Label = c("candy", 
"nocandy"), class = "factor"), header2 = structure(c(2L, 2L, 
2L, 1L, 1L, 1L), .Label = c("candy", "nocandy"), class = "factor")), .Names = c("header1", 
"header2"), row.names = c(NA, -6L), class = "data.frame")

答案 1 :(得分:1)

听起来您希望将列视为独立样本。如果是这样,这可能不是最好的数据结构。但你可以做到

#sample data
candytext<-read.table(text='header1    header2
 "nocandy"  "nocandy"
 "nocandy"  "nocandy"
 "nocandy"  "nocandy"
 "nocandy"    "candy"
 "nocandy"    "candy"
 "candy"    "candy"', header=T)

#summarize
do.call(cbind, lapply(candytext, table))
#         header1 header2
# candy         1       3
# nocandy       5       3

答案 2 :(得分:1)

尝试:

> dd = data.frame(sapply(candytext, summary))
> dd
        header1 header2
candy         1       3
nocandy       5       3

> chisq.test(dd)                
        Pearson's Chi-squared test with Yates' continuity correction                                                    

data:  dd                                                                                                               
X-squared = 0.375, df = 1, p-value = 0.5403                                                                             

Warning message:                                                                                                        
In chisq.test(dd) : Chi-squared approximation may be incorrect                                                          
>                                                                               

如果要从多列数据框中选择2列:

> cc = cbind(summary(candytext$header1), summary(candytext$header2))

> cc
        [,1] [,2]
candy      1    3
nocandy    5    3

> chisq.test(cc)

        Pearson's Chi-squared test with Yates' continuity correction

data:  cc
X-squared = 0.375, df = 1, p-value = 0.5403

Warning message:
In chisq.test(cc) : Chi-squared approximation may be incorrect

在下面的表格中,表格和摘要是相同的:

> cbind(table(candytext$header1), table(candytext$header2))
        [,1] [,2]
candy      1    3
nocandy    5    3
> 
> cbind(summary(candytext$header1), summary(candytext$header2))
        [,1] [,2]
candy      1    3
nocandy    5    3