Spark 2.2.2:CountVectorizerModel索引24691超出了大小为23262的向量的范围

时间:2018-01-10 19:14:48

标签: java apache-spark apache-spark-mllib

大家好,祝你有个美好的一天。根据你的经验,我想要一些帮助。我试图将一组文本文档转换为基于数组大小为24693的自定义词汇表的令牌计数向量,其中 CountVectorizerModel

以下是简单的代码

CountVectorizerModel cvm2 = new CountVectorizerModel(vocabulary)
                .setInputCol(NEXT)
                .setOutputCol(NEXT_RAW_FEATURES);
        cvm2.transform(dataset).show(false);

以下是我的完整例外:

Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$8: (array<string>) => vector)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
ERROR   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
----------------------
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: requirement failed: Index 24691 out of bounds for vector of size 23262
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.ml.linalg.SparseVector.<init>(Vectors.scala:570)
    at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:212)
    at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$8.apply(CountVectorizer.scala:265)
    at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$8.apply(CountVectorizer.scala:248)
    ... 16 more

为什么我

Index 24691 out of bounds for vector of size 23262

,我该如何解决?我是否需要适应

 setMinTF()

带前缀大小。我不知道该怎么做,所以我在这里堆。基本上,我无法理解为什么会发生这种情况以及如何解决它。如果有人帮助我,我将不胜感激。

1 个答案:

答案 0 :(得分:0)

您的vocab数组包含重复项。您需要使用数组删除重复项