我开始在pyspark(v 1.6.2)中使用kmeans聚类,其中包含以下示例,其中包含混合变量类型:
# Import libraries
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.mllib.clustering import KMeansModel
# Create sample DF
sample = sqlContext.createDataFrame([["a@email.com", 12000,"M"],
["b@email.com", 43000,"M"],
["c@email.com", 5000,"F"],
["d@email.com", 60000,"M"]]
, ["email", "income","gender"])
我使用StringIndexer
,OneHotEncoder
和VectorAssembler
来处理分类属性,如下所示:
# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']
encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']
# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")
这篇文章确保转换工作正常:
pipeline= Pipeline(stages=string_indexers+encoders+[assembler])
model=pipeline.fit(sample)
indexed=model.transform(sample)
indexed.show()
我知道我可以通过这样的方式对这个转化的DF做一个k-means:
kmeans = KMeans()
.setK(2)
.setFeaturesCol("features")
.setPredictionCol("prediction")
kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)
oo.select('email', 'income', 'gender', 'features',
'prediction').show(truncate = False)
+-----------+------+------+-------------------------+----------+
|email |income|gender|features |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M |[0.0,1.0,0.0,1.0,12000.0]|1 |
|b@email.com|43000 |M |(5,[3,4],[1.0,43000.0]) |0 |
|c@email.com|5000 |F |(5,[0,4],[1.0,5000.0]) |1 |
|d@email.com|60000 |M |[0.0,0.0,1.0,1.0,60000.0]|0 |
+-----------+------+------+-------------------------+----------+
但我想看看:
1)我如何用pyspark.mllib.clustering.KMeansModel做同样的事情以确定K的最佳(最低成本)值(与pyspark generic example中的KMeans.train和computeCost函数对齐)?
2)如何以原始比例获得聚类中心(意思是“男性”或“女性”标签不是编码比例)?
PySpark版本1.6.2