我有15列分类变量,我想要它们之间的相关性。数据集长度超过20,000,数据集如下所示:
state | job | hair_color | car_color | marital_status
NY | cs | brown | blue | s
FL | mt | black | blue | d
NY | md | blond | white | m
NY | cs | brown | red | s
请注意,第一行和最后一行NY
,cs
和s
会重复。我想找出那种模式。 NY和cs高度相关。我需要在列中对值的组合进行排名。希望现在这个问题有道理。请注意不计算NY
或cs
。是关于找出NY
和blond
在同一行中出现的次数。我需要按行对所有值执行此操作。希望现在这是有道理的。
我尝试将cor()
与R一起使用,但由于这些是分类变量,因此该功能不起作用。如何使用此数据集来查找它们之间的相关性?
答案 0 :(得分:0)
您可以参考Ways to calculate similarity。假设您的数据是
d <- structure(list(state = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("FL",
"NY"), class = "factor"), job = structure(c(2L, 1L, 4L, 3L, 2L
), .Label = c("bs", "cs", "md", "mt"), class = "factor"), hair_color = structure(c(3L,
3L, 1L, 2L, 3L), .Label = c("black", "blond", "brown"), class = "factor"),
car_color = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("blue",
"red", "white"), class = "factor"), marital_status = structure(c(3L,
1L, 1L, 2L, 3L), .Label = c("d", "m", "s"), class = "factor")), .Names = c("state",
"job", "hair_color", "car_color", "marital_status"), class = "data.frame", row.names = c(NA,
-5L))
数据:
> d
state job hair_color car_color marital_status
1 NY cs brown blue s
2 FL bs brown red d
3 FL mt black blue d
4 NY md blond white m
5 NY cs brown red s
我们可以计算观察之间的“不相似性”:
library(cluster)
daisy(d, metric = "euclidean")
输出:
> daisy(d, metric = "euclidean")
Dissimilarities :
1 2 3 4
2 0.8
3 0.8 0.6
4 0.8 1.0 1.0
5 0.2 0.6 1.0 0.8
Metric : mixed ; Types = N, N, N, N, N
Number of objects : 5
告诉我们观察1和5是最不相似的。通过许多观察,显然不可能在视觉上检查不相似矩阵,但我们可以过滤掉低于某个阈值的对,例如
out <- daisy(d, metric = "euclidean")
pairs <- expand.grid(2:5, 1:4)
pairs <- pairs[pairs[,1]!=pairs[,2],]
similars <- pairs[which(out<.8),]
给出0.8的阈值,
> similars
Var1 Var2
4 5 1
6 3 2
8 5 2