我有一个在线卡片分类活动的数据集。向参与者提供随机的卡片子集(来自更大的一组),并要求他们创建他们认为彼此相似的卡片组。参与者能够创建他们喜欢的任意数量的组,并根据他们想要的名称命名组。
示例数据集如下所示:
Data <- structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), Card = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 2L, 3L, 5L, 7L, 9L, 10L, 11L, 12L, 13L, 14L,
1L, 3L, 4L, 5L, 6L, 7L, 8L, 12L, 13L, 14L), .Label = c("A", "B",
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N"), class = "factor"),
Group = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 3L, 5L, 2L, 5L,
1L, 2L, 1L, 3L, 1L, 4L, 4L, 2L, 3L, 1L, 1L, 2L, 1L, 2L, 3L,
2L, 1L, 2L, 2L, 3L), .Label = c("Cat1", "Cat2", "Cat3", "Cat4",
"Cat5"), class = "factor")), .Names = c("Subject", "Card",
"Group"), class = "data.frame", row.names = c(NA, -30L))
根据这些数据,我想创建一个相似度矩阵,理想情况下,将项目组合在一起的总计数的比例或百分比。
像这样:
计数:
A B C D E F G H I J K L M N
A 0 0 1 1 0 0 1 0 0 0 0 0 0
B 0 0 0 1 0 0 0 2 0 0 0 0 1
C 0 0 0 0 1 2 0 0 0 0 2 1 0
D 1 0 0 0 0 0 1 0 0 0 0 0 0
E 1 1 0 0 0 1 0 1 0 0 1 1 1
F 0 0 1 0 0 1 0 0 0 0 0 0 1
G 0 0 2 0 1 1 0 0 0 0 1 2 0
H 1 0 0 1 0 0 0 0 1 0 0 0 0
I 0 2 0 0 1 0 0 0 0 0 0 0 1
J 0 0 0 0 0 0 0 1 0 1 0 0 0
K 0 0 0 0 0 0 0 0 0 1 0 0 0
L 0 0 2 0 1 0 1 0 0 0 0 1 0
M 0 0 1 0 1 0 2 0 0 0 0 1 0
N 0 1 0 0 1 1 0 0 1 0 0 0 0
每个主题都以不同的方式命名其组,因此无法按组进行索引。
除了计数之外,我还想生成一个相似性矩阵,该矩阵报告参与者的百分比,这些参与者被提交了一对特定的Cards
,将这两个Cards
分组一起。
从示例数据集中,结果如下:
A B C D E F G H I J K L M N
A 0 0 50 50 0 0 50 0 0 0 0 0 0
B 0 0 0 50 0 0 0 100 0 0 0 0 100
C 0 0 0 0 50 67 0 0 0 0 100 50 0
D 50 0 0 0 0 0 50 0 0 0 0 0 0
E 50 50 33 0 0 33 0 50 0 0 33 50 50
F 0 0 50 0 0 50 0 0 0 0 0 0 100
G 0 0 67 0 33 50 0 0 0 0 50 100 0
H 50 0 0 50 0 0 0 0 100 0 0 0 0
I 0 100 0 0 50 0 0 0 0 0 0 0 100
J 0 0 0 0 0 0 0 100 0 100 0 0 0
K 0 0 0 0 0 0 0 0 0 100 0 0 0
L 0 0 100 0 33 0 50 0 0 0 0 50 0
M 0 0 50 0 50 0 100 0 0 0 0 50 0
N 0 100 0 0 50 100 0 0 100 0 0 0 0
任何建议都将不胜感激!
编辑: 虽然以下答案适用于示例数据。它似乎不适用于我在此处发布的实际数据:https://www.dropbox.com/s/mhqwyok0nmvt3g9/Sim_Example.csv?dl=0
例如,在那些数据中,我手动计算了22对&#34;飞机&#34;和#34;机场&#34;,这将是~55%。但下面的答案得出的数字为12%和60%
答案 0 :(得分:1)
根据OP的要求澄清
编辑解决方案步骤1.处理数据以创建卡对&amp;他们是否被任何用户组合在一起:
library(tidyverse); library(data.table)
Data.matrix <- Data %>%
# convert data into list of data frames by subject
split(Data$Subject) %>%
# for each subject, we create all pair combinations based on the subset cards he
# received, & note down whether he grouped the pair into the same group
# (assume INTERNAL group naming consistency. i.e. if subject 1 uses group names such
# as "cat", "dog", "rat", they are all named exactly so, & we don't worry about
# variations / typos such as "cat1.5", "dgo", etc.)
lapply(function(x){
data.frame(V1 = t(combn(x$Card, 2))[,1],
V2 = t(combn(x$Card, 2))[,2],
G1 = x$Group[match(t(combn(x$Card, 2))[,1], x$Card)],
G2 = x$Group[match(t(combn(x$Card, 2))[,2], x$Card)],
stringsAsFactors = FALSE) %>%
mutate(co.occurrence = 1,
same.group = G1==G2) %>%
select(-G1, -G2)}) %>%
# combine the list of data frames back into one, now that we don't worry about group
# names, & calculate the proportion of times each pair is assigned the same group,
# based on the total number of times they occurred together in any subject's
# subset.
rbindlist() %>%
rowwise() %>%
mutate(V1.sorted = min(V1, V2),
V2.sorted = max(V1, V2)) %>%
ungroup() %>%
group_by(V1.sorted, V2.sorted) %>%
summarise(co.occurrence = sum(co.occurrence),
same.group = sum(same.group)) %>%
ungroup() %>%
rename(V1 = V1.sorted, V2 = V2.sorted) %>%
mutate(same.group.perc = same.group/co.occurrence * 100) %>%
# now V1 ranges from A:M, where V2 ranges from B:N. let's complete all combinations
mutate(V1 = factor(V1, levels = sort(unique(Data$Card))),
V2 = factor(V2, levels = sort(unique(Data$Card)))) %>%
complete(V1, V2, fill = list(NA))
> Data.matrix
# A tibble: 196 x 5
V1 V2 co.occurrence same.group same.group.perc
<fctr> <fctr> <dbl> <int> <dbl>
1 A A NA NA NA
2 A B 1 0 0
3 A C 2 0 0
4 A D 2 1 50
5 A E 2 1 50
6 A F 2 0 0
7 A G 2 0 0
8 A H 2 1 50
9 A I 1 0 0
10 A J 1 0 0
# ... with 186 more rows
# same.group is the number of times a card pair has been grouped together.
# same.group.perc is the percentage of users who grouped the card pair together.
步骤2.为count&amp; amp;创建单独的矩阵。百分比:
# spread count / percentage respectively into wide form
Data.count <- Data.matrix %>%
select(V1, V2, same.group) %>%
spread(V2, same.group, fill = 0) %>%
remove_rownames() %>%
column_to_rownames("V1") %>%
as.matrix()
Data.perc <- Data.matrix %>%
select(V1, V2, same.group.perc) %>%
spread(V2, same.group.perc, fill = 0) %>%
remove_rownames() %>%
column_to_rownames("V1") %>%
as.matrix()
步骤3.将上三角矩阵转换为对称矩阵(注意:我刚刚发现了一个更短且更整洁的解决方案here):
# fill up lower triangle to create symmetric matrices
Data.count[lower.tri(Data.count)] <- t(Data.count)[lower.tri(t(Data.count))]
Data.perc[lower.tri(Data.perc)] <- t(Data.perc)[lower.tri(t(Data.perc))]
# ALTERNATE to previous step
Data.count <- pmax(Data.count, t(Data.count))
Data.perc <- pmax(Data.perc, t(Data.perc))
步骤4.摆脱对角线,因为没有必要将卡与自身配对:
# convert diagonals to NA since you don't really need them
diag(Data.count) <- NA
diag(Data.perc) <- NA
步骤5.验证结果:
> Data.count
A B C D E F G H I J K L M N
A NA 0 0 1 1 0 0 1 0 0 0 0 0 0
B 0 NA 0 0 1 0 0 0 2 0 0 0 0 1
C 0 0 NA 0 1 1 2 0 0 0 0 2 1 0
D 1 0 0 NA 0 0 0 1 0 0 0 0 0 0
E 1 1 1 0 NA 0 1 0 1 0 0 1 1 1
F 0 0 1 0 0 NA 1 0 0 0 0 0 0 1
G 0 0 2 0 1 1 NA 0 0 0 0 1 2 0
H 1 0 0 1 0 0 0 NA 0 1 0 0 0 0
I 0 2 0 0 1 0 0 0 NA 0 0 0 0 1
J 0 0 0 0 0 0 0 1 0 NA 1 0 0 0
K 0 0 0 0 0 0 0 0 0 1 NA 0 0 0
L 0 0 2 0 1 0 1 0 0 0 0 NA 1 0
M 0 0 1 0 1 0 2 0 0 0 0 1 NA 0
N 0 1 0 0 1 1 0 0 1 0 0 0 0 NA
> Data.perc
A B C D E F G H I J K L M N
A NA 0 0 50 50 0 0 50 0 0 0 0 0 0
B 0 NA 0 0 50 0 0 0 100 0 0 0 0 100
C 0 0 NA 0 33 50 67 0 0 0 0 100 50 0
D 50 0 0 NA 0 0 0 50 0 0 0 0 0 0
E 50 50 33 0 NA 0 33 0 50 0 0 50 50 50
F 0 0 50 0 0 NA 50 0 0 0 0 0 0 100
G 0 0 67 0 33 50 NA 0 0 0 0 50 100 0
H 50 0 0 50 0 0 0 NA 0 100 0 0 0 0
I 0 100 0 0 50 0 0 0 NA 0 0 0 0 100
J 0 0 0 0 0 0 0 100 0 NA 100 0 0 0
K 0 0 0 0 0 0 0 0 0 100 NA 0 0 0
L 0 0 100 0 50 0 50 0 0 0 0 NA 50 0
M 0 0 50 0 50 0 100 0 0 0 0 50 NA 0
N 0 100 0 0 50 100 0 0 100 0 0 0 0 NA