Question

我有一个列表元素，每个元素都包含一个文本属性列表。

> list
[[1]]
 [1] "attribute 1"     
 [2] "attribute 2"     
 [3] "attribute 3"     

[[2]]
[1] "attribute 4"     
[2] "attribute 5" 
[3] "attribute 6" 

[[3]]
 [1] "attribute 1"     
 [2] "attribute 2"      

[[4]]
[1] "attribute 4"     
[2] "attribute 5"
[3] "attribute 6"

我可以应用什么分类或聚类算法（最简单）来根据文本属性的相似性对此元素进行分类。
获得如下结果：[1]类别为[1,3]，类别2为[2,4]。

Answer 1

<强>观

您可以在距离矩阵上使用hclust。为此，您首先需要将数据转换为矩阵，计算距离，然后在此矩阵上进行层次聚类。

<强>代码

l <- list(paste("attribute", 1:3),
          paste("attribute", 4:6),
          paste("attribute", 1:2),
          paste("attribute", 4:6))
allElem <- sort(unique(unlist(l)))
incidM <- do.call(rbind, lapply(l, function(x) as.numeric(allElem %in% x)))
colnames(incidM) <- allElem
rownames(incidM) <- paste("Set", seq_len(NROW(incidM)))
dM <- dist(incidM)
hc <- hclust(dM)
plot(hc)

<强>解释

首先，创建一个矩阵，其行与列表中的元素对应，行与列表中的唯一值对应。如果相应的列表元素包含此属性，则每个元素为1，否则为0。

incidM
#       attribute 1 attribute 2 attribute 3 attribute 4 attribute 5 attribute 6
# Set 1           1           1           1           0           0           0
# Set 2           0           0           0           1           1           1
# Set 3           1           1           0           0           0           0
# Set 4           0           0           0           1           1           1

然后，您可以计算行之间的距离矩阵，并在该矩阵上进行分层聚类。最后你可以绘制整个事物，你确实看到了Set 1＆amp; 3类似，2＆amp; 4。

根据R

1 个答案: