Question

我使用Spark 2.0.2。我有按天划分的数据。我想将不同的分区彼此独立地聚类，然后比较聚类中心（计算它们之间的距离），看看聚类如何随时间变化。

我为每个分区执行完全相同的预处理（缩放，一个热编码等）。我使用预定义的管道，在“正常”学习和预测环境中完美运行。但是当我想计算聚类中心之间的距离时，不同分区的相应向量具有不同的大小（不同的维度）。

一些代码段：

预处理管道构建如下：

val protoIndexer = new StringIndexer().setInputCol("protocol").setOutputCol("protocolIndexed").setHandleInvalid("skip")
val serviceIndexer = new StringIndexer().setInputCol("service").setOutputCol("serviceIndexed").setHandleInvalid("skip")
val directionIndexer = new StringIndexer().setInputCol("direction").setOutputCol("directionIndexed").setHandleInvalid("skip")

val protoEncoder = new OneHotEncoder().setInputCol("protocolIndexed").setOutputCol("protocolEncoded")
val serviceEncoder = new OneHotEncoder().setInputCol("serviceIndexed").setOutputCol("serviceEncoded")
val directionEncoder = new OneHotEncoder().setInputCol("directionIndexed").setOutputCol("directionEncoded")

val scaleAssembler = new VectorAssembler().setInputCols(Array("duration", "bytes", "packets", "tos", "host_count", "srv_count")).setOutputCol("scalableFeatures")
val scaler = new StandardScaler().setInputCol("scalableFeatures").setOutputCol("scaledFeatures")
val featureAssembler = new VectorAssembler().setInputCols(Array("scaledFeatures", "protocolEncoded", "urgent", "ack", "psh", "rst", "syn", "fin", "serviceEncoded", "directionEncoded")).setOutputCol("features")
val pipeline = new Pipeline().setStages(Array(protoIndexer, protoEncoder, serviceIndexer, serviceEncoder, directionIndexer, directionEncoder, scaleAssembler, scaler, featureAssembler))
pipeline.write.overwrite().save(config.getString("pipeline"))

定义k-means，加载预定义的预处理管道，向管道添加k-means：

val kmeans = new KMeans().setK(40).setTol(1.0e-6).setFeaturesCol("features")
val pipelineStages = Pipeline.load(config.getString("pipeline")).getStages
val pipeline = new Pipeline().setStages(pipelineStages ++ Array(kmeans))

加载数据分区，计算特征，拟合管道，获取k-means模型并显示第一个集群中心的大小，例如：

(1 to 7 by 1).map { day =>
  val data = sparkContext.textFile("path/to/data/" + day + "/")
  val rawFeatures = data.map(extractFeatures....).toDF(featureHeaders: _*)
  val model = pipeline.fit(rawFeatures)

  val kmeansModel = model.stages(model.stages.size - 1).asInstanceOf[KMeansModel]
  println(kmeansModel.clusterCenters(0).size)
}

对于不同的分区，群集中心具有不同的维度（但对于分区中的40个群集中的每个群集都是相同的）。所以我无法计算它们之间的距离。我怀疑它们都是相同的大小（即我的欧几里德空间的大小是13，因为我有13个特征）。但它给出了我不理解的奇怪数字。

我将提取的特征向量保存到文件中以进行检查。他们的格式是可疑的。每个功能都存在。

任何想法我做错了或者我是否有误解？谢谢！

Answer 1

忽略了KMeans is not a good choice for processing categorical data您的代码无法保证的事实：

批次之间的索引 - 特征关系相同。 StringIndexer按频率分配标签。最常见的字符串编码为0，最不常见的字符串编码为numLabels - 1。
批次之间的相同数量的inidces，因此具有相同形状的单热编码和组装的向量。向量的大小等于根据dropLast中OneHotEncoder参数的值调整的唯一标签数。

因此，编码矢量可能具有不同的尺寸和批次之间的解释。

如果你想要一致的编码，你需要持久的字典映射，以确保批次之间的索引一致。

集群中心在Spark MLlib中具有不同的维度

1 个答案: