如何使用pyspark计算Dataframe上每列的最大日期

时间:2017-04-13 06:59:04

标签: apache-spark pyspark

我有一个DataFrame作为A,如:

+---+---+---+---+----------+
|key| c1| c2| c3|      date|
+---+---+---+---+----------+
| k1| -1|  0| -1|2015-04-28|
| k1|  1| -1|  1|2015-07-28|
| k1|  1|  1|  1|2015-10-28|
| k2| -1|  0|  1|2015-04-28|
| k2| -1|  1| -1|2015-07-28|
| k2|  1| -1|  0|2015-10-28|
+---+---+---+---+----------+

创建A:

的代码
data = [('k1', '-1', '0', '-1','2015-04-28'),
    ('k1', '1', '-1', '1', '2015-07-28'),
    ('k1', '1', '1', '1', '2015-10-28'),
    ('k2', '-1', '0', '1', '2015-04-28'),
    ('k2', '-1', '1', '-1', '2015-07-28'),
    ('k2', '1', '-1', '0', '2015-10-28')]
A = spark.createDataFrame(data, ['key', 'c1', 'c2','c3','date'])
A = A.withColumn('date',A.date.cast('date'))

我想获得从c1到c5的某些列的最大日期,其中值等于1或-1。 B的预期结果:

+---+----------+----------+----------+----------+----------+----------+
|key|      c1_1|      c2_1|      c3_1|     c1_-1|     c2_-1|     c3_-1|
+---+----------+----------+----------+----------+----------+----------+
| k1|2015-10-28|2015-10-28|2015-10-28|2015-04-28|2015-07-28|2015-04-28|
| k2|2015-10-28|2015-07-28|2015-04-28|2015-07-28|2015-10-28|2015-07-28|
+---+----------+----------+----------+----------+----------+----------+

我的预览解决方案是使用数据透视操作分别计算c1-c2中的列,然后加入新创建的DateFrames。但是,在我的情况下,列太多了,我遇到了性能问题。所以,我希望得到另一个解决方案来替代DataFrame的加入。

1 个答案:

答案 0 :(得分:2)

首先melt the DataFrame

value_vars = ["c1", "c2", "c3"]
a_long = melt(A, id_vars=["key", "date"], value_vars=value_vars)

删掉零:

without_zeros = a_long.where(col("value") != 0)

合并变量值:

from pyspark.sql.functions import concat_ws

combined = without_zeros.withColumn(
    "cs", concat_ws("_", col("variable"), col("value")))

最后支点:

from pyspark.sql.functions import max

(combined
    .groupBy("key")
    .pivot("cs", ["{}_{}".format(c, i) for c in value_vars for i in [-1, 1]])
    .agg(max("date")))

结果是:

+---+----------+----------+----------+----------+----------+----------+
|key|     c1_-1|      c1_1|     c2_-1|      c2_1|     c3_-1|      c3_1|
+---+----------+----------+----------+----------+----------+----------+
| k2|2015-07-28|2015-10-28|2015-10-28|2015-07-28|2015-07-28|2015-04-28|
| k1|2015-04-28|2015-10-28|2015-07-28|2015-10-28|2015-04-28|2015-10-28|
+---+----------+----------+----------+----------+----------+----------+