针对CCO抛出NegativeArraySizeException的Apache Mahout SimilarityAnalysis

时间:2017-04-19 20:57:06

标签: apache scala mahout mahout-recommender

在为CCO调用Apache Mahout的SimilarityAnalysis时,我得到一个关于NegativeArraySizeException的致命异常。

我正在运行的代码如下所示:

val result = SimilarityAnalysis.cooccurrencesIDSs(myIndexedDataSet:Array[IndexedDataset],
      randomSeed = 1234,
      maxInterestingItemsPerThing = 3,
      maxNumInteractions = 4)

我看到以下错误和相应的堆栈跟踪:

17/04/19 20:49:09 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 20)
java.lang.NegativeArraySizeException
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/04/19 20:49:09 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 21)
java.lang.NegativeArraySizeException
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/04/19 20:49:09 WARN TaskSetManager: Lost task 0.0 in stage 11.0 (TID 20, localhost): java.lang.NegativeArraySizeException
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我正在使用Apache Mahout版本0.13.0

2 个答案:

答案 0 :(得分:1)

这总是意味着其中一个输入矩阵是空的。阵列中有多少个矩阵?每个中的行数和列数是多少? IndexedDatasetSpark的伴随对象提供了一个构造函数,在Scala中称为apply,需要RDD[String, String],因此如果您可以将数据导入RDD,只需使用它构造IndexedDatasetSpark。这里的字符串对是user-id,item-id用于某些事件,如购买。

请在此处查看Companion对象:https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala#L75

稍微搜索会找到用一行代码将csv转换为RDD [String,String]的代码。它看起来像这样:

val rawPurchaseInteractions = sc.textFile("/path/in/hdfs").map { line =>
  (line.split("\,")(0), (line.split("\,")(1))
}

虽然这会分裂两次,但它期望文本文件中以逗号分隔的行列表user-id,item-id用于某种类型的交互,例如“购买”。如果文件中还有其他字段,则只需拆分以获取user-id和item-id。 map函数中的行返回一对字符串,因此生成的RDD将是正确的类型,即RDD[String, String]。将其传递给IndexedDatasetSpark:

val purchasesRdd = IndexedDatasetSpark(rawPurchaseInteractions)(sc)

其中sc是您的Spark上下文。这应该为您提供非空IndexedDatasetSpark,您可以通过查看已包装的BiDictionary的大小或通过调用包装的Mahout DRM上的方法来检查。

顺便说一句,这假设csv没有标头。这是文本分隔的而不是完整的规范csv。使用Spark中的其他方法,您可以读取真实的CSV,但可能没有必要。

答案 1 :(得分:0)

这个问题实际上与Mahout无关,而是早先的一句话:

inputRDD.filter(_ (1) == primaryFilter).map(o => (o(0), o(2)))

范围已关闭,我有1到3而不是0到2.我认为确定它所在的地方是在Mahout内部给出了错误,但结果证明这是真正的问题。