调用pyspark.ml.feature

时间:2018-12-07 01:15:29

标签: pyspark pca py4j

我正在尝试使用pyspark的PCA功能来可视化word2vec单词,但收到一条无用的错误消息。说列要素的类型错误,但事实并非如此。 (以下完整消息)

背景

  

spark-2.4.0-bin-hadoop2.7

     

Scala 2.12.7(Java HotSpot™64位服务器VM,Java 1.8.0_191)。

     

3.6.5 | Anaconda,Inc。

     

Ubuntu 16.04

我的代码

maxWordsVis = 15

Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')  
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat) 
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])

$ dfFeat.head()

  

Row(features = DenseVector([-0.1282,0.0699,-0.0891,-0.0437,-0.0915,-0.0557,0.1432,-0.1564,0.0058,-0.0603,0.1383,-0.0359,-0.0306,-0.0415,-0.0191 ,0.058、0.0119,-0.0302、0.0362,-0.0466、0.0403,-0.1035、0.0456、0.0892、0.0548,-0.0735、0.1094,-0.0299,-0.0549,-0.1235、0.0062、0.1381,-0.0082、0.085,-0.0083, -0.0346,-0.0226,-0.0084,-0.0463,-0.0448、0.0285,-0.0013、0.0343,-0.0056、0.0756,-0.0068、0.0562、0.0638、0.023,-0.0224,-0.0228、0.0281,-0.0698,-0.0044, 0.0395,-0.021、0.0228、0.0666、0.0362、0.0116,-0.0088、0.0949、0.0265,-0.0293,-0.007,-0.0746、0.0891、0.0145、0.0532,-0.0084,-0.0853、0.0037,-0.055,-0.0706,- 0.0296、0.0321、0.0495,-0.0776,-0.1339,-0.065、0.0856、0.0328、0.0821、0.036,-0.0179,-0.0006,-0.036、0.0438,-0.0077,-0.0012、0.0322、0.0354、0.0513、0.0436、0.0002, -0.0578,0.1062,0.019,0.0346,-0.1261])

numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")

错误消息

Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed: 
Column features must be of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.  
     at scala.Predef$.require(Predef.scala:224)

0 个答案:

没有答案