Deduping在使用scala的spark中使用hiveContext

时间:2015-12-16 22:03:32

标签: scala apache-spark

我正在尝试使用Scala中的spark中的hiveContext来重复记录事件记录。 df to rdd是编译错误,说“对象Tuple23不是包scala的成员”。有一个已知的问题,Scala Tuple不能有23个或更多 有没有其他方法重复数据删除

val events = hiveContext.table("default.my_table")
val valid_events = events.select(
                              events("key1"),events("key2"),events("col3"),events("col4"),events("col5"),
                              events("col6"),events("col7"),events("col8"),events("col9"),events("col10"),
                              events("col11"),events("col12"),events("col13"),events("col14"),events("col15"),
                              events("col16"),events("col17"),events("col18"),events("col19"),events("col20"),
                              events("col21"),events("col22"),events("col23"),events("col24"),events("col25"),
                              events("col26"),events("col27"),events("col28"),events("col29"),events("epoch")
                              )
//events are deduped based on latest epoch time
val valid_events_rdd = valid_events.rdd.map(t => {
                                                  ((t(0),t(1)),(t(2),t(3),t(4),t(5),t(6),t(7),t(8),t(9),t(10),t(11),t(12),t(13),t(14),t(15),t(16),t(17),t(18),t(19),t(20),t(21),t(22),t(23),t(24),t(25),t(26),t(28),t(29)))
                                              })

// reduce by key so we will only get one record for every primary key
val reducedRDD =  valid_events_rdd.reduceByKey((a,b) => if ((a._29).compareTo(b._29) > 0) a else b)
//Get all the fields
reducedRDD.map(r => r._1 + "," + r._2._1 + "," + r._2._2).collect().foreach(println)

1 个答案:

答案 0 :(得分:1)

脱离我的头顶:

  • 不再具有大小限制的用例类。请记住,案例类在Spark REPL中无法正常工作,
  • 直接使用Row个对象并仅提取密钥
  • 直接在DataFrame上运行,

    import org.apache.spark.sql.functions.{col, max}
    
    val maxs = df
      .groupBy(col("key1"), col("key2"))
      .agg(max(col("epoch")).alias("epoch"))
      .as("maxs")
    
    df.as("df")
      .join(maxs,
        col("df.key1") === col("maxs.key1") && 
        col("df.key2") === col("maxs.key2") &&
        col("df.epoch") === col("maxs.epoch"))
      .drop(maxs("epoch"))
      .drop(maxs("key1"))
      .drop(maxs("key2"))
    

    或使用窗口功能:

    val w = Window.partitionBy($"key1", $"key2").orderBy($"epoch")
    
    df.withColumn("rn_", rowNumber.over(w)).where($"rn" === 1).drop("rn")