多个分类变量的相关性

时间:2016-08-08 17:44:18

标签: r correlation categorical-data

问题已更新!!

我有15列分类变量,我想要它们之间的相关性。数据集长度超过20,000,数据集如下所示:

state | job | hair_color | car_color | marital_status
NY    | cs  | brown      | blue      | s
FL    | mt  | black      | blue      | d
NY    | md  | blond      | white     | m
NY    | cs  | brown      | red       | s

请注意,第一行和最后一行NYcss会重复。我想找出那种模式。 NY和cs高度相关。我需要在列中对值的组合进行排名。希望现在这个问题有道理。请注意计算NYcs。是关于找出NYblond在同一行中出现的次数。我需要按行对所有值执行此操作。希望现在这是有道理的。

我尝试将cor()与R一起使用,但由于这些是分类变量,因此该功能不起作用。如何使用此数据集来查找它们之间的相关性?

1 个答案:

答案 0 :(得分:0)

您可以参考Ways to calculate similarity。假设您的数据是

d <- structure(list(state = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("FL", 
"NY"), class = "factor"), job = structure(c(2L, 1L, 4L, 3L, 2L
), .Label = c("bs", "cs", "md", "mt"), class = "factor"), hair_color = structure(c(3L, 
3L, 1L, 2L, 3L), .Label = c("black", "blond", "brown"), class = "factor"), 
    car_color = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("blue", 
    "red", "white"), class = "factor"), marital_status = structure(c(3L, 
    1L, 1L, 2L, 3L), .Label = c("d", "m", "s"), class = "factor")), .Names = c("state", 
"job", "hair_color", "car_color", "marital_status"), class = "data.frame", row.names = c(NA, 
-5L))

数据:

> d
  state job hair_color car_color marital_status
1    NY  cs      brown      blue              s
2    FL  bs      brown       red              d
3    FL  mt      black      blue              d
4    NY  md      blond     white              m
5    NY  cs      brown       red              s

我们可以计算观察之间的“不相似性”:

library(cluster)
daisy(d, metric = "euclidean")

输出:

> daisy(d, metric = "euclidean")
Dissimilarities :
    1   2   3   4
2 0.8            
3 0.8 0.6        
4 0.8 1.0 1.0    
5 0.2 0.6 1.0 0.8

Metric :  mixed ;  Types = N, N, N, N, N 
Number of objects : 5

告诉我们观察1和5是最不相似的。通过许多观察,显然不可能在视觉上检查不相似矩阵,但我们可以过滤掉低于某个阈值的对,例如

out <- daisy(d, metric = "euclidean")
pairs <- expand.grid(2:5, 1:4)
pairs <- pairs[pairs[,1]!=pairs[,2],]
similars <- pairs[which(out<.8),]

给出0.8的阈值,

> similars
  Var1 Var2
4    5    1
6    3    2
8    5    2