我有一个这种结构的RDD
RDD[((String, String), List[(Int, Timestamp, String)])]
和数据
((D2,Saad Arif),List((4,2011-10-05 00:00:00.0,C101), (5,2010-01-27 00:00:00.0,C101)))
((D3,Faran Abid),List((7,2016-10-05 00:00:00.0,C101)))
((D1,Atif Shahzad),List((1,2012-04-15 00:00:00.0,C101), (2,2011-10-05 00:00:00.0,C101), (3,2006-12-25 00:00:00.0,C101)))
将此视为表格意味着
'(D2,Saad Arif)'
就像键和
'List((4,2011-10-05 00:00:00.0,C101), (5,2010-01-27 00:00:00.0,C101)'
就像这个键的行。 现在我想检查每一行,如果在两年或更长时间之前有代码'C101'的记录(历史),则将level设置为2,否则设置为1.因此生成的RDD应如下所示
((D2,Saad Arif),List((4,2011-10-05 00:00:00.0,C101, 1), (5,2010-01-27 00:00:00.0,C101, 1)))
((D3,Faran Abid),List((7,2016-10-05 00:00:00.0,C101, 1)))
((D1,Atif Shahzad),List((1,2012-04-15 00:00:00.0,C101, 2), (2,2011-10-05 00:00:00.0,C101, 2), (3,2006-12-25 00:00:00.0,C101, 1)))
注意时间戳后的新级别。如何使用地图或平面图进行此操作?
答案 0 :(得分:1)
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import java.time.Period
val df1 = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.S")
val futureDate = LocalDate.parse("2100-01-01 00:00:00.0", df1)
val yourRequiredRdd = yourRdd
.map({
case (t, list) => {
val list1 = list.map({
case (id, dateStr, id2) => (id, LocalDate.parse(dateStr, df1), id2)
})
val oldestDate = list1
.filter({ case (id, date, id2) => id2.equals("C101") })
.map(_._2)
.foldLeft(futureDate)((oldestDate, date) => {
val period = Period.between(oldestDate, date)
if (!period.isNegative()) oldestDate else date
})
val newList = list1
.map({
case (id, date, "C101") => {
val periodFromOldestDate = Period.between(oldestDate, date)
val extraNumber = if (periodFromOldestDate.getYears() >= 2) 2 else 1
(id, date, "C101", extraNumber)
}
case (id, date, id2) => {
(id, date, id2, 1)
}
})
(t, newList)
}
})
.flatMap({
case ((pid, name), list) => list.map({
case (id, date, code, level) => (id, name, code, pid, date, level)
})
})