在为CCO调用Apache Mahout的SimilarityAnalysis时,我得到一个关于NegativeArraySizeException的致命异常。
我正在运行的代码如下所示:
val result = SimilarityAnalysis.cooccurrencesIDSs(myIndexedDataSet:Array[IndexedDataset],
randomSeed = 1234,
maxInterestingItemsPerThing = 3,
maxNumInteractions = 4)
我看到以下错误和相应的堆栈跟踪:
17/04/19 20:49:09 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 20)
java.lang.NegativeArraySizeException
at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/19 20:49:09 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 21)
java.lang.NegativeArraySizeException
at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/04/19 20:49:09 WARN TaskSetManager: Lost task 0.0 in stage 11.0 (TID 20, localhost): java.lang.NegativeArraySizeException
at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57)
at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73)
at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我正在使用Apache Mahout版本0.13.0
答案 0 :(得分:1)
这总是意味着其中一个输入矩阵是空的。阵列中有多少个矩阵?每个中的行数和列数是多少? IndexedDatasetSpark的伴随对象提供了一个构造函数,在Scala中称为apply
,需要RDD[String, String]
,因此如果您可以将数据导入RDD,只需使用它构造IndexedDatasetSpark。这里的字符串对是user-id,item-id用于某些事件,如购买。
请在此处查看Companion对象:https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala#L75
稍微搜索会找到用一行代码将csv转换为RDD [String,String]的代码。它看起来像这样:
val rawPurchaseInteractions = sc.textFile("/path/in/hdfs").map { line =>
(line.split("\,")(0), (line.split("\,")(1))
}
虽然这会分裂两次,但它期望文本文件中以逗号分隔的行列表user-id,item-id
用于某种类型的交互,例如“购买”。如果文件中还有其他字段,则只需拆分以获取user-id和item-id。 map
函数中的行返回一对字符串,因此生成的RDD将是正确的类型,即RDD[String, String]
。将其传递给IndexedDatasetSpark:
val purchasesRdd = IndexedDatasetSpark(rawPurchaseInteractions)(sc)
其中sc是您的Spark上下文。这应该为您提供非空IndexedDatasetSpark
,您可以通过查看已包装的BiDictionary
的大小或通过调用包装的Mahout DRM上的方法来检查。
顺便说一句,这假设csv没有标头。这是文本分隔的而不是完整的规范csv。使用Spark中的其他方法,您可以读取真实的CSV,但可能没有必要。
答案 1 :(得分:0)
这个问题实际上与Mahout无关,而是早先的一句话:
inputRDD.filter(_ (1) == primaryFilter).map(o => (o(0), o(2)))
范围已关闭,我有1到3而不是0到2.我认为确定它所在的地方是在Mahout内部给出了错误,但结果证明这是真正的问题。