如何在pyspark中命名kmeans集群

时间:2018-07-27 07:41:02

标签: python-3.x apache-spark pyspark apache-spark-sql apache-spark-ml

我有以下代码:

%pyspark
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
(trainingData, testData) = dataFrame.randomSplit([0.7, 0.3])
assembler = VectorAssembler(inputCols = ["PetalLength", "PetalWidth", "SepalLength", "SepalWidth"], outputCol="features")
kmeans = KMeans().setK(3).setSeed(101010)
pipeline = Pipeline(stages=[assembler, kmeans])
modelKMeans = pipeline.fit(dataFrame)

运行此命令时:

predictions = modelKMeans.transform(testData)
z.show(predictions)

我想在预测列中看到“ Iris-setosa”而不是0,“ Iris-versicolor”而不是1,“ Iris-virginica”而不是2。有可能吗?

1 个答案:

答案 0 :(得分:0)

KMeans不是分类算法,它是聚类算法。因此,它不知道它对应的集群是什么。如果要使用“ Iris-setosa”而不是0,则必须首先检查“ Iris-setosa”组是否对应于0。您不能事先这样做。然后,您可以使用映射创建一个新列:

# Then some summarise operations

df.total_by_grp <- df.tidy %>%
  mutate(dagala_total = dagala_price * dagala_unit) 

# summarise by group
head(df.total_by_grp)
#>        code groups dagala_price dagala_unit dagala_total
#> 1 MI-NAL-KA      1           50         100         5000
#> 2   M-KK-KZ      1        10000          20       200000
#> 3   M-KK-NK      1        10000           5        50000
#> 4  MI-NA-BA      1        12000           2        24000
#> 5  MI-BD-BT      1        12000           3        36000
#> 6  MI-MI-ND      1        12000           8        96000


df.total_by_code <- df.tidy %>%
  mutate(dagala_total = dagala_price * dagala_unit) %>%
  group_by(code) %>%
  summarise(code_total = sum(dagala_total,na.rm = TRUE))

# summarise by total
head(df.total_by_code)
#> # A tibble: 6 x 2
#>   code      code_total
#>   <chr>          <int>
#> 1 M-BY-BGY       67000
#> 2 M-KK-KZ       240000
#> 3 M-KK-NK        50000
#> 4 MI-BD-BT       51000
#> 5 MI-KAM-AL     196000
#> 6 MI-KAN-BL     143000