Question

例如，我有一个单词列表列表

[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes','orange'],
 ['potatoes','rice']]

集合更大。我想将通常一起存在的单词具有相同的群集。因此，在这种情况下，群集将是['apple', 'banana', 'orange']和['rice','potatoes']。
归档这种群集的最佳方法是什么？

Answer 1

我认为将问题视为图表更自然。

例如，您可以假设apple是节点0，而banana是节点1，第一个列表表明存在0到1之间的边。

所以首先将标签转换为数字：

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(['apple','banana','orange','rice','potatoes'])

现在：

l=[['apple','banana'],
 ['apple','orange'],
 ['banana','orange'],
 ['rice','potatoes'], #I deleted orange as edge is between 2 points, you can  transform the triple to 3 pairs or think of different solution
 ['potatoes','rice']]

将标签转换为数字：

edges=[le.transform(x) for x in l]

>>edges

[array([0, 1], dtype=int64),
array([0, 2], dtype=int64),
array([1, 2], dtype=int64),
array([4, 3], dtype=int64),
array([3, 4], dtype=int64)]

现在，开始构建图形并添加边缘：

import networkx as nx #graphs package
G=nx.Graph() #create the graph and add edges
for e in edges:
    G.add_edge(e[0],e[1])

现在您可以使用connected_component_subgraphs函数来分析连接的顶点。

components = nx.connected_component_subgraphs(G) #analyze connected subgraphs
comp_dict = {idx: comp.nodes() for idx, comp in enumerate(components)}
print(comp_dict)

输出：

{0：[0，1，2]，1：[3，4]}

或

print([le.inverse_transform(v) for v in comp_dict.values()])

输出：

[array（['apple'，'banana'，'orange']），array（['potatoes'，'rice']）]

，这是您的2个群集。

Answer 2

寻找频繁项目集将更有意义。

如果将这类 short 单词集聚在一起，则所有内容通常都将在几个级别上连接：没有什么共同之处，一个共同点，两个共同点。这太粗糙了，无法用于群集。您将获得一切连接或一无所有，并且结果可能对数据更改和排序高度敏感。

因此放弃了对数据进行分区的范例-而是寻找频繁的组合。

Answer 3

因此，经过大量谷歌搜索之后，我发现我实际上不能使用聚类技术，因为我缺少可以对单词进行聚类的特征变量。如果我在一张桌子上记下每个单词与其他单词（实际上是笛卡尔积）的出现频率，实际上是邻接矩阵，并且聚类不能很好地工作。

因此，我正在寻找的解决方案是图形社区检测。我使用igraph库（或python的python-ipgraph包装器）来查找集群，并且它运行得非常好而且很快。

更多信息：

类似的问题：https://stats.stackexchange.com/questions/142297/finding-natural-groups-clusters-in-an-undirected-graph-over-several-undirect
方格纸中的社区检测：https://arxiv.org/pdf/0906.0612.pdf
各种算法的基本描述：What are the differences between community detection algorithms in igraph?

单词聚类列表列表

3 个答案: