Question

我试图对一组数据进行集群分析，但无法找到适当的洞察力。示例：我在100个资源（列）中有一组50个变量（行）。每种资源都有一些变量作为强度，另一种作为弱点。我将强度标记为1，弱点标记为2.因为，每个资源可能只有10个变量作为强度，5个变量作为弱点，因此其他遗漏变量标记为零。现在，我想找到共享优势和劣势的资源集群。

我通过转置数据集使用层次聚类和k-means，以便资源在行中。 k-means图显示了不同聚类之间过多的重叠，因此仅使用分层聚类。我用+10替换了1（强度），用-10替换了2（弱点），看看群集算法是否有不同的反应，但仍然没有多大帮助。

改进此方法的任何输入以及处理此方法的替代方法？

非常感谢！

Answer 1

The following code should help you make dummy/ binary variables.

settingStrength <- as.numeric(setting.g == "Strength")
settingWeakness   <- as.numeric(setting.g == "Weakness")

I called the hierarchial cluster focussing on the columns 3 and 4 of your dataset. You cannot cluster 100 dimensions and plot them in a two-dimensional plot. You have to reduce dimensionality first. You are right be choosing hierarchical clustering, because k-means requires that you know the number of clusters and you do not know them

CLUSTER <- hclust(dist(YOURDATA[, 3:4]))
plot(CLUSTER)

However if you use k-means you should not have only a look at the plot. In the following I have chosen three groups of clustering.

KMEANSCLUSTER <- kmeans(YOURDATASET[,3:4],3)
KMEANSCLUSTER[1]

Now you should see a vector of with the length of 15 (the length of your data) and values 1,2 & 3. The value to whether the variable belongs to cluster "1", cluster "2" or cluster "3".

Answer 2

聚类二进制数据（以及低基数和分类虚拟编码数据）的问题在于它是二进制信息。

k-means等方法是为连续变量设计的，其中均值是有意义的，几乎每个距离都是唯一的。

使用二进制数据，一切都会同时发生变化。你有很多重复的记录。您有1个位置，2个位置等的记录。 - 在您的情况下，它们最多可以在30个位置上有所不同，因此您有31个相似级别。

解决方案通常是从群集移动到项目集挖掘视图。这与聚类没有根本的不同，但它以二元假设开始：一个项目存在于事务中，或者它不存在。然后，结果项集对应于频繁组合，例如，有A和B的行往往有C。

尝试使用频繁的项目集和组合规则。

R中的聚类分析与虚拟编码变量

2 个答案: