R

时间:2019-11-17 17:16:46

标签: r model linear-regression correlation categorical-data

我希望分析分类输入变量和二项式响应变量之间的相关性,但是我不确定如何组织数据或计划进行正确的分析。

这是我的数据表(变量在下面说明):

species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar")
scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1)
dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0)
pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0)
nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0)
plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0)
blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)
loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no
data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss)
data #check data table

数据表说明

我列出了单个物种,下一列是其带注释的喂养类型。给定列中的1表示是,0表示否。有些物种有多种进食类型,而有些物种只有一种进食类型。我感兴趣的响应变量是“损失”,表示特征缺失。我很想知道是否有任何一种喂养类型能够预测或与“损失”状态相关。

思想

我不确定是否有一种很好的方法将供稿类型作为具有多个类别的一个类别变量包括在内。我不认为我可以将其组织为c(“ scavenge”,“ dung”,“ pred”等类型的单个变量,因为某些种类有多种进食类型,因此我将它们分成单独的几种列,并将其状态显示为1(是)或0(否)。目前,我正在考虑尝试使用对数线性分析,但是我发现的示例并没有可比较的数据……我很高兴能提出建议。

非常感谢您的帮助或指出正确的方向!

1 个答案:

答案 0 :(得分:1)

样本太少,您有4个亏损== 0和18个亏损==1。您将遇到拟合完全logistic回归的问题(即包括所有变量)。我建议使用费舍尔测试来测试每种喂养习惯的关联性:

library(dplyr)
library(purrr)

# function for the fisher test
FISHER <- function(x,y){
       FT = fisher.test(table(x,y))

data.frame(
       pvalue=FT$p.value,
       oddsratio=as.numeric(FT$estimate),
       lower_limit_OR = FT$conf.int[1],
       upper_limit_OR = FT$conf.int[2]
)
}
# define variables to test
FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom")
# we loop through and test association between each variable and "loss"

results <- data[,FEEDING] %>% 
map_dfr(FISHER,y=data$loss) %>% 
add_column(var=FEEDING,.before=1)

您会得到每种喂养习惯的结果:

> results
       var      pvalue oddsratio lower_limit_OR upper_limit_OR
1 scavenge 0.264251538 0.1817465    0.002943469       2.817560
2     dung 1.000000000 1.1582683    0.017827686      20.132849
3     pred 0.263157895 0.0000000    0.000000000       3.189217
4   nectar 0.535201640 0.0000000    0.000000000       5.503659
5    plant 0.002597403       Inf    2.780171314            Inf
6    blood 1.000000000 0.0000000    0.000000000      26.102285
7 mushroom 0.337662338 5.0498688    0.054241930     467.892765

p值是fisher.test的p值,基本上比值比> 1,该变量与损失呈正相关。在所有变量中,植物是最强的,您可以检查:

> table(loss,plant)
    plant
loss  0  1
   0 18  0
   1  1  3

几乎所有工厂= 1,损失= 1 ..因此,对于您当前的数据集,我认为这是您可以做的最好的事情。应该获取更大的样本量,以查看是否仍然成立。