pyspark==2.4.0
以下是给出异常的代码:
LDA = spark.read.parquet('./LDA.parquet/')
LDA.printSchema()
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
kmeans = KMeans(featuresCol='topic_vector_fix_dim').setK(15).setSeed(1)
model = kmeans.fit(LDA)
root
|-ID:字串(nullable = true)
|-topic_vector_fix_dim:数组(nullable = true)
| |-元素:双精度(containsNull = true)
IllegalArgumentException:
'要求失败:列topic_vector_fix_dim的类型必须等于以下类型之一:[struct <类型:tinyint,size:int,indices:array
我很困惑-它不喜欢我的array <double>
,但是说这可能是输入。
topic_vector_fix_dim 的每个条目都是一维浮点数数组
答案 0 :(得分:1)
containsNull
应该设置为False
:
new_schema = ArrayType(DoubleType(), containsNull=False)
udf_foo = udf(lambda x:x, new_schema)
LDA = LDA.withColumn("topic_vector_fix_dim",udf_foo("topic_vector_fix_dim"))
之后一切正常。
答案 1 :(得分:1)
containsNull
的答案对我不起作用,但确实如此:
vectorAssembler = VectorAssembler(inputCols = ["x1", "x2", "x3"], outputCol = "features")
df = vectorAssembler.transform(df)
df = df.select(['features', 'Y'])