我正在从HDFS读取数据。我为每个用户设置了多行,我必须选择每个用户的最新行。
行示例(RDD [Id: Int, DateTime: String, Name: STRING]
)
1,2016-05-01 01:01:01,testa
2,2016-05-02 01:01:01,testb
1,2016-05-05 01:01:01,testa
在上面的示例中,有两行,其中Id = 1,但我希望每个id只有一次(只有最新的一个和它的相应数据)我想要输出RDD,如下所示。
2,2016-05-02 01:01:01,testb
1,2016-05-05 01:01:01,testa
我的想法
我可以将这些数据收集到一个数组中并运行for循环以获得所需的结果,方法是保留每个用户最新的数据。
我读到collect给主节点提供数据。我的数据是30 GB,Master上的RAM是25 GB。所以我不想尝试这个。
你们可以分享你的想法和代码来完成这项任务吗?
答案 0 :(得分:1)
通过选择具有最新时间戳的元组,将Date-String转换为时间戳并在id上聚合。
import java.time.format.DateTimeFormatter
import java.time.LocalDateTime
val yourRdd: RDD[Int, String, String] = sc.parallelize(List(
1, "2016-05-01 01:01:01", "testa"
2, "2016-05-02 01:01:01", "testb"
1, "2016-05-05 01:01:01", "testa"
))
val dateFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH-mm-ss");
val zeroVal = ("", Long.MinValue, "", "")
val rddWithTimestamp = yourRdd
.map({
case (id, datetimeStr, name) => {
val timestamp: Long = LocalDateTime.parse(datetimeStr, dateFormetter)
.toInstant().toEpochMilli()
(id, (id, timestamp, datetimeStr, name))
}
})
val yourRequiredRdd = rddWithTimestamp
.aggregateByKey(zeroValue)(
(t1, t2) => if (t1._2 > t2._2) t1 else t2
(t1, t2) => if (t1._2 > t2._2) t1 else t2
)
答案 1 :(得分:1)
您可以使用DataFrame
API:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
(1, "2016-05-01 01:01:01", "testA"),
(2, "2016-05-02 01:01:01", "testB"),
(1, "2016-05-05 01:01:01", "testA")))
.toDF("id", "dateTime", "name")
df.withColumn("dateTime", unix_timestamp($"dateTime"))
.groupBy("id", "name")
.max("dateTime")
.withColumnRenamed("max(dateTime)", "dateTime")
.withColumn("dateTime", from_unixtime($"dateTime"))
.show()
这需要HiveContext
作为SQLContext
:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
答案 2 :(得分:0)
这可能会帮助有需要的人。
val yourRdd = sc.parallelize(List(
(30, ("1122112211111".toLong, "testa", "testa", "testa")),
(1, ("1122112211111".toLong, "testa", "testa", "testa")),
(1, ("1122112211119".toLong, "testa", "testa", "testa")),
(1, ("1122112211112".toLong, "testa", "testa", "testa")),
(2, ("1122112211111".toLong, "testa", "testa", "testa")),
(2, ("1122112211110".toLong, "testa", "testa", "testa"))
))
val addToSet1 = (
s: (Int, (Long, String, String, String)),
v: ((Long, String, String, String))
) => if (s._2._1 > v._1 ) s else (s._1,v)
val mergePartitionSets1 = (
s: (Int, (Long, String, String, String)),
v: (Int, (Long, String, String, String))
) => if (s._2._1 > v._2._1 ) s else v
val ab1 = yourRdd
.aggregateByKey(initialSet)(addToSet1, mergePartitionSets1)
ab1.take(10).foreach(println)