spark-scala:groupBy抛出java.lang.ArrayIndexOutOfBoundsException(null)

时间:2015-07-22 16:56:18

标签: java scala apache-spark indexoutofboundsexception

我开始关注一个包含字符串

的RDD
((26468EDE20E38,239394055),(7665710590658745,-414963169),0,f1,1420276980302)
((26468EDE20E38,239394055),(8016905020647641,183812619),1,f4,1420347885727)
((26468EDE20E38,239394055),(6633110906332136,294201185),1,f2,1420398323110)
((26468EDE20E38,239394055),(6633110906332136,294201185),0,f1,1420451687525)
((26468EDE20E38,239394055),(7722056727387069,1396896294),0,f1,1420537469065)
((26468EDE20E38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)
((26468EDE20E38,239394055),(8045651092287275,-4814845),0,f1,1420720722185)
((26468EDE20E38,239394055),(5170029699836178,-1332814297),0,f2,1420750531018)
((26468EDE20E38,239394055),(7722056727387069,1396896294),0,f1,1420807545137)
((26468EDE20E38,239394055),(4784119468604853,1287554938),0,f1,1421050087824)

只是提供有关数据描述的高级视图。您可以将主元组中的第一个元素(第一个元组)视为用户标识,将第二个元组视为产品标识,第三个元素是用户对产品的偏好。 (为了将来的参考,我将上面的数据标记为val userData

以下是我构建userData

的方法
val userData = data.map(x => {
  val userId = x._1.replace("_","")
  val programId = x._2
  val feature = someMethod(x._5, x._4)
  val userHashTuple = (userId, userId.hashCode)
  val programHashTuple = (programId, programId.hashCode)
  (userHashTuple, programHashTuple, x._3, timeFeature, x._4)
})

我的目标是,如果用户为产品投放了肯定(1)和否定(0)偏好,则只记录正数。例如:

((26468EDE20E38,239394055),(7722056727387069,1396896294),0,f1,1420537469065)
((26468EDE20E38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)

我只想保留

((26468EDE20E38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)

根据之前提供的答案,我遵循以下内容:

val grpData = userData.groupBy(x => (x._1, x._2)).mapValues(_.maxBy(_._3)).values

但我在groupBy阶段遇到错误。 java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 并且在增加重复数量时不断重复相同的错误。另外一个例外并没有指向我的代码,所以我很难搞清楚到底发生了什么。

最后我得到了以下信息:

 java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

这是否意味着密钥为空?但是如果key为空,它仍然会为""执行groupBy。所以我有点困惑。

在挖掘之后我发现了以下内容:https://issues.apache.org/jira/browse/SPARK-6772

但我不认为这个问题是一样的,但这就是我所得到的。

0 个答案:

没有答案