我正在使用H2O(特别是H2O流)进行K均值聚类。我选择了“标准化”复选框,以确保“在计算距离之前将列标准化”。它训练得很好,我调查了结果。它在结果中显示“ within_cluster_sum_of_squares”以供查看。我的问题是“ within_cluster_sum_of_squares”是标准化之前还是之后的距离?它看起来在显示标准化后的距离,但是我看到的距离很大,而且似乎在标准化之前(尽管我不确定)。任何想法 ?谢谢。
答案 0 :(得分:0)
When you select standardize for K-Means in Flow, it does standardize the columns before computing the distances (setting shown below).
So to answer your question the "within_cluster_sum_of_squares" is the distance calculation that is computed after standardization is performed.
One reason your metric value may seem too big could be if you were expecting the H2O-3 Kmeans standardize option to perform normalization (e.g.normalize = x / ||x||) rather than standardization (e.g. standardize = (x - mean) / sd)
From the k-means documentation here is the overview of the standardization option:
standardize: Enable this option to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is enabled by default.
Note: If standardization is enabled, each column of numeric data is centered and scaled so that its mean is zero and its standard deviation is one before the algorithm is used. At the end of the process, the cluster centers on both the standardized scale (centers_std) and the de-standardized scale (centers). To de-standardize the centers, the algorithm multiplies by the original standard deviation of the corresponding column and adds the original mean. Enabling standardization is mathematically equivalent to using h2o.scale in R with center = TRUE and scale = TRUE on the numeric columns. Therefore, there will be no discernible difference if standardization is enabled or not for K-Means, since H2O calculates unstandardized centroids.