将Word2VecModel与UserDefinedFunction一起使用时出现NullPointerException

时间:2018-04-26 22:15:05

标签: scala apache-spark machine-learning nlp word2vec

我正在尝试将word2vec模型对象传递给我的spark udf。基本上我有一个带有电影ID的测试集,我想传递ID和模型对象以获得每行的推荐电影数组。

def udfGetSynonyms(model: org.apache.spark.ml.feature.Word2VecModel) = 
     udf((col : String)  => {
          model.findSynonymsArray("20", 1)
})

然而这给了我一个空指针异常。当我在udf之外运行model.findSynonymsArray(" 20",1)时,我得到了预期的答案。由于某种原因,它对udf中的函数没有任何了解,但可以在udf之外运行它。

注意:我添加了" 20"这里只是得到一个固定的答案,看看是否会奏效。我替换" 20"与col。

感谢您的帮助!

堆栈跟踪:

SparkException: Job aborted due to stage failure: Task 0 in stage 23127.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23127.0 (TID 4646648, 10.56.243.178, executor 149): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$udfGetSynonyms1$1: (string) => array<struct<_1:string,_2:double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Word2VecModel.findSynonymsArray(Word2Vec.scala:273)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:7)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:4)
... 12 more

2 个答案:

答案 0 :(得分:1)

SQL和udf API有点受限,我不确定是否有办法将自定义类型用作列或作为udfs的输入。一些谷歌搜索没有发现任何有用的东西。

相反,您可以使用DataSetRDD API,只使用常规的Scala函数代替udf,例如:

val model: Word2VecModel = ...
val inputs: DataSet[String] = ...
inputs.map(movieId => model.findSynonymsArray(movieId, 10))

或者,我猜你可以将字符串序列化为字符串,但这看起来更加丑陋。

答案 1 :(得分:0)

我认为发生此问题是因为wordVectors是一个临时变量

class Word2VecModel private[ml] (
    @Since("1.4.0") override val uid: String,
    @transient private val wordVectors: feature.Word2VecModel)
  extends Model[Word2VecModel] with Word2VecBase with MLWritable {

我已经解决了这一问题,方法是广播w2vModel.getVectors并在每个分区内重新创建Word2VecModel模型