Spark Pivot On Date

时间:2017-08-06 09:39:34

标签: scala date apache-spark dataframe pivot

原始DataFrame看起来像:

+--------------------+--------------------+--------------------+
|             user_id|    measurement_date|            features|
+--------------------+--------------------+--------------------+
|b6d0bb3d-7a8e-4ac...|2016-06-28 02:00:...|[3492.68576170840...|
..
|048ffee9-a942-4d1...|2016-04-28 02:00:...|[1404.42230898422...|
|05101595-5a6f-4cd...|2016-07-10 02:00:...|[1898.50082132108...|
+--------------------+--------------------+--------------------+

我的支持力度:

data = data.select(data.col("user_id"),data.col("features"),data.col("measurement_date").cast(DateType).alias("date")).filter(data.col("measurement_date").between("2016-01-01", "2016-01-07"))
data = data.select(data.col("user_id"),data.col("features"),data.col("date")).groupBy("user_id","features").pivot("date").min()

我的输出是:

+--------------------+--------------------+
|             user_id|            features|
+--------------------+--------------------+
|14cd26dc-200a-436...|[2281.34579074947...|
..
|d8ae1b5e-c1e0-4bf...|[2568.49641198251...|
|1cceb175-12b4-4c3...|[4436.36029554227...|
+--------------------+--------------------+

我希望2016-01-01,..,2016-01-07的列丢失,根本没有透视。 我做错了什么?

编辑:

这就是DataFrame在第一个语句之后的样子:

|             user_id|            features|      date|
+--------------------+--------------------+----------+
|60f1cd63-0d5a-4f2...|[1553.35305181118...|2016-01-05|
|a56d1fef-5f17-4c9...|[1704.34897309186...|2016-01-02|
..
|992b6a34-803d-44b...|[1518.14292508305...|2016-01-05|

值得注意的是(user_id,features)不是时间序列,数据中存在空白。有时候某些日期没有测量,在这种情况下我想要Null作为条目。

1 个答案:

答案 0 :(得分:1)

您忘记了汇总部分。所以你的第二行代码应该是

data = data.select(data.col("user_id"),data.col("features"),data.col("date")).groupBy("user_id","features").pivot("date").agg(min("date"))