应用错误收集

调用pyspark.ml.feature

时间：2018-12-07 01:15:29

标签： pyspark pca py4j

我正在尝试使用pyspark的PCA功能来可视化word2vec单词，但收到一条无用的错误消息。说列要素的类型错误，但事实并非如此。（以下完整消息）

背景

spark-2.4.0-bin-hadoop2.7

Scala 2.12.7（Java HotSpot™64位服务器VM，Java 1.8.0_191）。

3.6.5 | Anaconda，Inc。

Ubuntu 16.04

我的代码

maxWordsVis = 15

Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')  
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat) 
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])

$ dfFeat.head（）

Row（features = DenseVector（[-0.1282，0.0699，-0.0891，-0.0437，-0.0915，-0.0557，0.1432，-0.1564，0.0058，-0.0603，0.1383，-0.0359，-0.0306，-0.0415，-0.0191 ，0.058、0.0119，-0.0302、0.0362，-0.0466、0.0403，-0.1035、0.0456、0.0892、0.0548，-0.0735、0.1094，-0.0299，-0.0549，-0.1235、0.0062、0.1381，-0.0082、0.085，-0.0083， -0.0346，-0.0226，-0.0084，-0.0463，-0.0448、0.0285，-0.0013、0.0343，-0.0056、0.0756，-0.0068、0.0562、0.0638、0.023，-0.0224，-0.0228、0.0281，-0.0698，-0.0044， 0.0395，-0.021、0.0228、0.0666、0.0362、0.0116，-0.0088、0.0949、0.0265，-0.0293，-0.007，-0.0746、0.0891、0.0145、0.0532，-0.0084，-0.0853、0.0037，-0.055，-0.0706，- 0.0296、0.0321、0.0495，-0.0776，-0.1339，-0.065、0.0856、0.0328、0.0821、0.036，-0.0179，-0.0006，-0.036、0.0438，-0.0077，-0.0012、0.0322、0.0354、0.0513、0.0436、0.0002， -0.0578，0.1062，0.019，0.0346，-0.1261]）

numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")

错误消息

Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed: 
Column features must be of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.  
     at scala.Predef$.require(Predef.scala:224)

0 个答案:

没有答案