PySpark K-means与分类变量

时间:2017-05-21 00:08:12

标签: pyspark cluster-analysis apache-spark-mllib

我开始在pyspark(v 1.6.2)中使用kmeans聚类,其中包含以下示例,其中包含混合变量类型:

# Import libraries
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.mllib.clustering import KMeansModel

# Create sample DF
sample = sqlContext.createDataFrame([["a@email.com", 12000,"M"],
                                ["b@email.com", 43000,"M"],
                                ["c@email.com", 5000,"F"],
                                ["d@email.com", 60000,"M"]]
                                , ["email", "income","gender"])

我使用StringIndexerOneHotEncoderVectorAssembler来处理分类属性,如下所示:

# Indexers encode strings
string_indexers=[
StringIndexer(inputCol=x,outputCol="idx_{0}".format(x))
for x in sample.columns if x not in 'income']

encoders=[
OneHotEncoder(inputCol="idx_{0}".format(x),outputCol="enc_{0}".format(x))
for x in sample.columns if x not in 'income']

# Assemble multiple columns into a single vector
assembler=VectorAssembler(
inputCols=["enc_{0}".format(x) for x in sample.columns if x not in 'income'] + ['income'],
outputCol="features")

这篇文章确保转换工作正常:

pipeline= Pipeline(stages=string_indexers+encoders+[assembler])
model=pipeline.fit(sample)
indexed=model.transform(sample)

indexed.show()

我知道我可以通过这样的方式对这个转化的DF做一个k-means:

kmeans = KMeans()
          .setK(2)
          .setFeaturesCol("features")
          .setPredictionCol("prediction")

kmeans_transformer = kmeans.fit(indexed)
oo = kmeans_transformer.transform(indexed)

oo.select('email', 'income', 'gender', 'features', 
'prediction').show(truncate = False)

+-----------+------+------+-------------------------+----------+
|email      |income|gender|features                 |prediction|
+-----------+------+------+-------------------------+----------+
|a@email.com|12000 |M     |[0.0,1.0,0.0,1.0,12000.0]|1         |
|b@email.com|43000 |M     |(5,[3,4],[1.0,43000.0])  |0         |
|c@email.com|5000  |F     |(5,[0,4],[1.0,5000.0])   |1         |
|d@email.com|60000 |M     |[0.0,0.0,1.0,1.0,60000.0]|0         |
+-----------+------+------+-------------------------+----------+

但我想看看:

1)我如何用pyspark.mllib.clustering.KMeansModel做同样的事情以确定K的最佳(最低成本)值(与pyspark generic example中的KMeans.train和computeCost函数对齐)?

2)如何以原始比例获得聚类中心(意思是“男性”或“女性”标签不是编码比例)?

PySpark版本1.6.2

0 个答案:

没有答案