我有一个DataFrame
作为A,如:
+---+---+---+---+----------+
|key| c1| c2| c3| date|
+---+---+---+---+----------+
| k1| -1| 0| -1|2015-04-28|
| k1| 1| -1| 1|2015-07-28|
| k1| 1| 1| 1|2015-10-28|
| k2| -1| 0| 1|2015-04-28|
| k2| -1| 1| -1|2015-07-28|
| k2| 1| -1| 0|2015-10-28|
+---+---+---+---+----------+
创建A:
的代码data = [('k1', '-1', '0', '-1','2015-04-28'),
('k1', '1', '-1', '1', '2015-07-28'),
('k1', '1', '1', '1', '2015-10-28'),
('k2', '-1', '0', '1', '2015-04-28'),
('k2', '-1', '1', '-1', '2015-07-28'),
('k2', '1', '-1', '0', '2015-10-28')]
A = spark.createDataFrame(data, ['key', 'c1', 'c2','c3','date'])
A = A.withColumn('date',A.date.cast('date'))
我想获得从c1到c5的某些列的最大日期,其中值等于1或-1。 B的预期结果:
+---+----------+----------+----------+----------+----------+----------+
|key| c1_1| c2_1| c3_1| c1_-1| c2_-1| c3_-1|
+---+----------+----------+----------+----------+----------+----------+
| k1|2015-10-28|2015-10-28|2015-10-28|2015-04-28|2015-07-28|2015-04-28|
| k2|2015-10-28|2015-07-28|2015-04-28|2015-07-28|2015-10-28|2015-07-28|
+---+----------+----------+----------+----------+----------+----------+
我的预览解决方案是使用数据透视操作分别计算c1-c2中的列,然后加入新创建的DateFrames
。但是,在我的情况下,列太多了,我遇到了性能问题。所以,我希望得到另一个解决方案来替代DataFrame
的加入。
答案 0 :(得分:2)
value_vars = ["c1", "c2", "c3"]
a_long = melt(A, id_vars=["key", "date"], value_vars=value_vars)
删掉零:
without_zeros = a_long.where(col("value") != 0)
合并变量值:
from pyspark.sql.functions import concat_ws
combined = without_zeros.withColumn(
"cs", concat_ws("_", col("variable"), col("value")))
最后支点:
from pyspark.sql.functions import max
(combined
.groupBy("key")
.pivot("cs", ["{}_{}".format(c, i) for c in value_vars for i in [-1, 1]])
.agg(max("date")))
结果是:
+---+----------+----------+----------+----------+----------+----------+
|key| c1_-1| c1_1| c2_-1| c2_1| c3_-1| c3_1|
+---+----------+----------+----------+----------+----------+----------+
| k2|2015-07-28|2015-10-28|2015-10-28|2015-07-28|2015-07-28|2015-04-28|
| k1|2015-04-28|2015-10-28|2015-07-28|2015-10-28|2015-04-28|2015-10-28|
+---+----------+----------+----------+----------+----------+----------+