通过日期和小时交叉表

时间:2018-09-20 23:52:46

标签: scala apache-spark

样本DF:

var someDF = Seq(
(1, "2017-12-02 03:04:00"),
(1, "2017-12-02 03:45:00"),
(1, "2017-12-02 04:04:00"),
(2, "2017-12-02 04:14:00"),
(2, "2017-12-02 04:54:00"),
(3, "2017-10-01 11:45:20"),
(4, "2017-10-01 02:45:20")
).toDF("number", "date")

OP:

+------+-------------------+
|number|               date|
+------+-------------------+
|     1|2017-12-02 03:04:00|
|     1|2017-12-02 03:45:00|
|     1|2017-12-02 04:04:00|
|     2|2017-12-02 04:14:00|
|     2|2017-12-02 04:54:00|
|     3|2017-10-01 11:45:20|
|     4|2017-10-01 02:45:20|
+------+-------------------+

当我尝试使用交叉表时:

var temp = someDF.stat.crosstab("date","number")
temp.show()

OP:

+-------------------+---+---+---+---+
|        date_number|  1|  2|  3|  4|
+-------------------+---+---+---+---+
|2017-10-01 11:45:20|  0|  0|  1|  0|
|2017-12-02 03:04:00|  1|  0|  0|  0|
|2017-12-02 04:54:00|  0|  1|  0|  0|
|2017-12-02 04:14:00|  0|  1|  0|  0|
|2017-12-02 03:45:00|  1|  0|  0|  0|
|2017-12-02 04:04:00|  1|  0|  0|  0|
|2017-10-01 02:45:20|  0|  0|  0|  1|
+-------------------+---+---+---+---+

我想应用相同的交叉表,但单独使用date_and_hour,例如:2017-12-02 03:

预期的操作次数:

+-------------------+---+---+---+---+
|   date_Hour_number|  1|  2|  3|  4|
+-------------------+---+---+---+---+
|2017-10-01 11      |  0|  0|  1|  0|
|2017-12-02 03 .    |  1|  0|  0|  0|
|2017-12-02 04 .    |  0|  2|  0|  0|

任何建议都会有所帮助

1 个答案:

答案 0 :(得分:1)

由于您的date列是字符串类型,因此在应用substring之前,您可以简单地使用datehour缩小为crosstab

someDF.
  withColumn("datehour", substring($"date", 0, 13)).
  stat.crosstab("datehour", "number").
  show
// +---------------+---+---+---+---+
// |datehour_number|  1|  2|  3|  4|
// +---------------+---+---+---+---+
// |  2017-10-01 02|  0|  0|  0|  1|
// |  2017-10-01 11|  0|  0|  1|  0|
// |  2017-12-02 04|  1|  2|  0|  0|
// |  2017-12-02 03|  2|  0|  0|  0|
// +---------------+---+---+---+---+