Question

我正在从HDFS读取数据。我为每个用户设置了多行，我必须选择每个用户的最新行。

行示例（RDD [Id: Int, DateTime: String, Name: STRING]）

1,2016-05-01 01:01:01,testa
2,2016-05-02 01:01:01,testb
1,2016-05-05 01:01:01,testa

在上面的示例中，有两行，其中Id = 1，但我希望每个id只有一次（只有最新的一个和它的相应数据）我想要输出RDD，如下所示。

2,2016-05-02 01:01:01,testb
1,2016-05-05 01:01:01,testa

我的想法

我可以将这些数据收集到一个数组中并运行for循环以获得所需的结果，方法是保留每个用户最新的数据。

我读到collect给主节点提供数据。我的数据是30 GB，Master上的RAM是25 GB。所以我不想尝试这个。

你们可以分享你的想法和代码来完成这项任务吗？

Answer 1

通过选择具有最新时间戳的元组，将Date-String转换为时间戳并在id上聚合。

import java.time.format.DateTimeFormatter
import java.time.LocalDateTime

val yourRdd: RDD[Int, String, String] = sc.parallelize(List(
  1, "2016-05-01 01:01:01", "testa"
  2, "2016-05-02 01:01:01", "testb"
  1, "2016-05-05 01:01:01", "testa"
))

val dateFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH-mm-ss");

val zeroVal = ("", Long.MinValue, "", "")

val rddWithTimestamp = yourRdd
  .map({
    case (id, datetimeStr, name) => {
      val timestamp: Long = LocalDateTime.parse(datetimeStr, dateFormetter)
        .toInstant().toEpochMilli()

      (id, (id, timestamp, datetimeStr, name))
    }
  })

val yourRequiredRdd = rddWithTimestamp
  .aggregateByKey(zeroValue)(
    (t1, t2) => if (t1._2 > t2._2) t1 else t2
    (t1, t2) => if (t1._2 > t2._2) t1 else t2
  )

Answer 2

您可以使用DataFrame API：

import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
  (1, "2016-05-01 01:01:01", "testA"),
  (2, "2016-05-02 01:01:01", "testB"),
  (1, "2016-05-05 01:01:01", "testA")))
  .toDF("id", "dateTime", "name")

df.withColumn("dateTime", unix_timestamp($"dateTime"))
  .groupBy("id", "name")
  .max("dateTime")
  .withColumnRenamed("max(dateTime)", "dateTime")
  .withColumn("dateTime", from_unixtime($"dateTime"))
  .show()

这需要HiveContext作为SQLContext：

import org.apache.spark.sql.hive.HiveContext

val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

Answer 3

这可能会帮助有需要的人。

val yourRdd = sc.parallelize(List(
 (30, ("1122112211111".toLong, "testa", "testa", "testa")),
 (1, ("1122112211111".toLong, "testa", "testa", "testa")),
 (1, ("1122112211119".toLong, "testa", "testa", "testa")),
 (1, ("1122112211112".toLong, "testa", "testa", "testa")),
 (2, ("1122112211111".toLong, "testa", "testa", "testa")),
 (2, ("1122112211110".toLong, "testa", "testa", "testa"))
))

val addToSet1 = (
  s: (Int, (Long, String, String, String)),
  v: ((Long, String, String, String))
) => if (s._2._1 > v._1 ) s else (s._1,v)

val mergePartitionSets1 = (
  s: (Int, (Long, String, String, String)),
  v: (Int, (Long, String, String, String))
) => if (s._2._1 > v._2._1 ) s else v

val ab1 = yourRdd
  .aggregateByKey(initialSet)(addToSet1, mergePartitionSets1)

ab1.take(10).foreach(println)

Apache Spark RDD：如何根据配对的RDD密钥和值获取最新数据

3 个答案: