我有以下代码:
%pyspark
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
(trainingData, testData) = dataFrame.randomSplit([0.7, 0.3])
assembler = VectorAssembler(inputCols = ["PetalLength", "PetalWidth", "SepalLength", "SepalWidth"], outputCol="features")
kmeans = KMeans().setK(3).setSeed(101010)
pipeline = Pipeline(stages=[assembler, kmeans])
modelKMeans = pipeline.fit(dataFrame)
运行此命令时:
predictions = modelKMeans.transform(testData)
z.show(predictions)
我想在预测列中看到“ Iris-setosa”而不是0,“ Iris-versicolor”而不是1,“ Iris-virginica”而不是2。有可能吗?
答案 0 :(得分:0)
KMeans不是分类算法,它是聚类算法。因此,它不知道它对应的集群是什么。如果要使用“ Iris-setosa”而不是0,则必须首先检查“ Iris-setosa”组是否对应于0。您不能事先这样做。然后,您可以使用映射创建一个新列:
# Then some summarise operations
df.total_by_grp <- df.tidy %>%
mutate(dagala_total = dagala_price * dagala_unit)
# summarise by group
head(df.total_by_grp)
#> code groups dagala_price dagala_unit dagala_total
#> 1 MI-NAL-KA 1 50 100 5000
#> 2 M-KK-KZ 1 10000 20 200000
#> 3 M-KK-NK 1 10000 5 50000
#> 4 MI-NA-BA 1 12000 2 24000
#> 5 MI-BD-BT 1 12000 3 36000
#> 6 MI-MI-ND 1 12000 8 96000
df.total_by_code <- df.tidy %>%
mutate(dagala_total = dagala_price * dagala_unit) %>%
group_by(code) %>%
summarise(code_total = sum(dagala_total,na.rm = TRUE))
# summarise by total
head(df.total_by_code)
#> # A tibble: 6 x 2
#> code code_total
#> <chr> <int>
#> 1 M-BY-BGY 67000
#> 2 M-KK-KZ 240000
#> 3 M-KK-NK 50000
#> 4 MI-BD-BT 51000
#> 5 MI-KAM-AL 196000
#> 6 MI-KAN-BL 143000