Question

我是R的新手，正在处理包括名义，序数和公制数据的数据集。因此我正在使用gower距离。在下一步中，我将此距离与hclust(x, method="complete")一起使用，以根据此距离创建聚类。

现在我想知道如何在gower距离中对变量赋予不同的权重。文档说：

daisy(x, metric = c("euclidean", "manhattan", "gower"), stand = FALSE, type = list(), weights = rep.int(1, p))

所以有一种方法，但我不确定语法（weights = ...）。 weights和rep.int的文档没有帮助。我也没有找到任何其他有用的解释。

如果有人可以提供帮助，我会很高兴。

Answer 1

不确定这是否是你所得到的，但是......

假设您有5个变量，例如数据框或矩阵中的5列。然后weights将是length=5的向量，其中包含相应列的权重。

文档中的符号weights=rep.int(1,p)仅表示权重的默认值是长度为p且具有全1的向量，例如。权重都等于1.在文档的其他地方，它解释了p是列数。

另外，请注意daisy(...)会产生不相似矩阵。这是您在hclust(...)中使用的内容。因此，如果x是包含五列变量的数据框或矩阵，则：

d  <- daisy(x, metric="gower", weights=c(1,2,3,4,5))
hc <- hclust(d, method="complete")

编辑（对OP评论的回应）

下面的代码显示了聚类如何取决于权重。

clust.anal <- function(df,w,h) {
  require(cluster)
  d  <- daisy(df, metric="gower", weights=w)
  hc <- hclust(d, method="complete")
  clust <- cutree(hc,h=h)
  plot(hc, sub=paste("weights=",paste(wts,collapse=",")))
  rect.hclust(hc,h=0.8,border="red")

}

df <- read.table("ExampleClusterData.csv", sep=";",header=T)
df[1] <- factor(df[[1]])
df[2] <- factor(df[[2]])
# weights increase with col number...
wts=c(1,2,3,4,5,6,7)
clust.anal(df,wts,h=0.8)

# weights decrease with col number...
wts=c(7,6,5,4,3,2,1)
clust.anal(df,wts,h=0.8)

如何用r中的gower距离对变量进行加权

1 个答案: