我开始关注一个包含字符串
的RDD((26468EDE20E38,239394055),(7665710590658745,-414963169),0,f1,1420276980302)
((26468EDE20E38,239394055),(8016905020647641,183812619),1,f4,1420347885727)
((26468EDE20E38,239394055),(6633110906332136,294201185),1,f2,1420398323110)
((26468EDE20E38,239394055),(6633110906332136,294201185),0,f1,1420451687525)
((26468EDE20E38,239394055),(7722056727387069,1396896294),0,f1,1420537469065)
((26468EDE20E38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)
((26468EDE20E38,239394055),(8045651092287275,-4814845),0,f1,1420720722185)
((26468EDE20E38,239394055),(5170029699836178,-1332814297),0,f2,1420750531018)
((26468EDE20E38,239394055),(7722056727387069,1396896294),0,f1,1420807545137)
((26468EDE20E38,239394055),(4784119468604853,1287554938),0,f1,1421050087824)
只是提供有关数据描述的高级视图。您可以将主元组中的第一个元素(第一个元组)视为用户标识,将第二个元组视为产品标识,第三个元素是用户对产品的偏好。 (为了将来的参考,我将上面的数据标记为val userData
)
以下是我构建userData
:
val userData = data.map(x => {
val userId = x._1.replace("_","")
val programId = x._2
val feature = someMethod(x._5, x._4)
val userHashTuple = (userId, userId.hashCode)
val programHashTuple = (programId, programId.hashCode)
(userHashTuple, programHashTuple, x._3, timeFeature, x._4)
})
我的目标是,如果用户为产品投放了肯定(1)
和否定(0)
偏好,则只记录正数。例如:
((26468EDE20E38,239394055),(7722056727387069,1396896294),0,f1,1420537469065)
((26468EDE20E38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)
我只想保留
((26468EDE20E38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)
根据之前提供的答案,我遵循以下内容:
val grpData = userData.groupBy(x => (x._1, x._2)).mapValues(_.maxBy(_._3)).values
但我在groupBy
阶段遇到错误。 java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1]
并且在增加重复数量时不断重复相同的错误。另外一个例外并没有指向我的代码,所以我很难搞清楚到底发生了什么。
最后我得到了以下信息:
java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
这是否意味着密钥为空?但是如果key为空,它仍然会为""
执行groupBy。所以我有点困惑。
在挖掘之后我发现了以下内容:https://issues.apache.org/jira/browse/SPARK-6772
但我不认为这个问题是一样的,但这就是我所得到的。