Spark按日期分组数据,以及日期缺少的id的零填充数据

时间:2018-05-23 14:01:32

标签: java apache-spark apache-spark-sql apache-spark-mllib

我有一个spark数据集,我需要按日期对数据进行分组,并在日期不存在id时填充数据。我还需要将数据集转换30天,因为我的结束日期在第30天在原始类型数据集中下降,这是我正在使用的样本数据集。进行这种转变的最佳方法是什么?

df1 <- structure(list(ID = c("P-1", "P-2", "P-2", "P-3", "P-3", "P-3"
), Final = c("A", "A", "B", "A", "B", "B"), Val1 = c("A", "0", 
"0", "A", "A", "A"), Val2 = c("0", "A", "A", "B", "B", "B"), 
    Val3 = c(NA, NA, NA, NA, NA, NA), Val4 = c("0", "B", "B", 
    "B", "B", "B"), Val5 = c("0", "", "", "B", "B", "B")), .Names = c("ID", 
"Final", "Val1", "Val2", "Val3", "Val4", "Val5"),
   class = "data.frame", row.names = c(NA, 
-6L))

输出

val genre = sc.parallelize(List(("id1", "2016-05-01", "action",0),
                                    ("id1", "2016-05-03", "horror",1),
                                    ("id2", "2016-05-03", "art",0),
                                    ("id2", "2016-05-04", "action",0))).
                               toDF("id","date","genre","score")

期望的输出

+---+----------+------+-----+
| id|      date| genre|score|
+---+----------+------+-----+
|id1|2016-05-01|action|    0|
|id1|2016-05-03|horror|    1|
|id2|2016-05-03|   art|    0|
|id2|2016-05-04|action|    0|
+---+----------+------+-----+

0 个答案:

没有答案