Spark:在稀疏向量上运行LDA

时间:2018-08-10 08:57:50

标签: scala apache-spark apache-zeppelin lda

我正在使用lda模型中的tfidf向量拟合数据(约10万行):

tfidfDf.show()
   -----+--------------------+
|     id|          average_sv|
+-------+--------------------+
|4860362|(8388608,[11385,1...|
|4860360|(8388608,[117559,...|
|4860355|(8388608,[26941,3...|
               .
               .
               .

Tfidf向量的类型为import org.apache.spark.ml.linalg.SparseVector。型号:

import org.apache.spark.ml.clustering.LDA
val lda = new LDA()
            .setK(5)
            .setMaxIter(10)
            .setFeaturesCol("average_sv")

现在,当我尝试健身时,出现以下错误:

val modelLda = lda.fit(tfidfDf)

org.apache.thrift.transport.TTransportException
 at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
 at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
 at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
 at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266)
 at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250)
 at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:373)
 at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
 at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:406)
 at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
 at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

为什么会这样?

我正在使用zeppelin-0.7.3Spark 2.1.0

Documentation

中以LDA为例正常运行

0 个答案:

没有答案