Question

pyspark==2.4.0

以下是给出异常的代码：

LDA = spark.read.parquet('./LDA.parquet/')
LDA.printSchema()

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans(featuresCol='topic_vector_fix_dim').setK(15).setSeed(1)
model = kmeans.fit(LDA)

root
|-ID：字串（nullable = true）
|-topic_vector_fix_dim：数组（nullable = true）
| |-元素：双精度（containsNull = true）

IllegalArgumentException： '要求失败：列topic_vector_fix_dim的类型必须等于以下类型之一：[struct <类型：tinyint，size：int，indices：array ，values：array >， array ，数组]，但实际上是 array 类型。'

我很困惑-它不喜欢我的array <double>，但是说这可能是输入。
topic_vector_fix_dim 的每个条目都是一维浮点数数组

Answer 1

功能列的

containsNull应该设置为False：

new_schema = ArrayType(DoubleType(), containsNull=False)
udf_foo = udf(lambda x:x, new_schema)
LDA = LDA.withColumn("topic_vector_fix_dim",udf_foo("topic_vector_fix_dim"))

之后一切正常。

Answer 2

containsNull的答案对我不起作用，但确实如此：

vectorAssembler = VectorAssembler(inputCols = ["x1", "x2", "x3"], outputCol = "features")
df = vectorAssembler.transform(df)
df = df.select(['features', 'Y'])

Pyspark KMeans群集功能列IllegalArgumentException

2 个答案: