Question

我目前正在尝试对Pyspark数据框上的值上的值进行调整后的列进行别名。这里的问题是没有正确设置放入别名调用的列名称。

一个具体的例子：

从这个数据框开始：

import pyspark.sql.functions as func

df = sc.parallelize([
    (217498, 100000001, 'A'), (217498, 100000025, 'A'), (217498, 100000124, 'A'),
    (217498, 100000152, 'B'), (217498, 100000165, 'C'), (217498, 100000177, 'C'),
    (217498, 100000182, 'A'), (217498, 100000197, 'B'), (217498, 100000210, 'B'),
    (854123, 100000005, 'A'), (854123, 100000007, 'A')
]).toDF(["user_id", "timestamp", "actions"])

给出了

+-------+--------------------+------------+
|user_id|     timestamp      |  actions   |
+-------+--------------------+------------+
| 217498|           100000001|    'A'     |
| 217498|           100000025|    'A'     |
| 217498|           100000124|    'A'     |
| 217498|           100000152|    'B'     |
| 217498|           100000165|    'C'     |
| 217498|           100000177|    'C'     |
| 217498|           100000182|    'A'     |
| 217498|           100000197|    'B'     |
| 217498|           100000210|    'B'     |
| 854123|           100000005|    'A'     |
| 854123|           100000007|    'A'     |

问题在于调用

df = df.groupby('user_id')\
       .pivot('actions')\
       .agg(func.count('timestamp').alias('ts_count'),
            func.mean('timestamp').alias('ts_mean'))

给出列名

df.columns

['user_id',
 'A_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L',
 'A_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5',
 'B_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L',
 'B_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5',
 'C_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L',
 'C_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5']

这是完全不切实际的。

我可以使用here - (regex)显示的方法清理我的列名或here - (use of withColumnRenamed()。但是，这些是更新后很容易破解的变通方法。

总结一下：如何使用pivot生成的列而不必解析它们？（例如＆＃39; A_（count（timestamp），mode = Complete，isDistinct = false）AS ts_count＃4L＆＃39;生成的名称）？

任何帮助将不胜感激！感谢

Answer 1

这种情况正在发生，因为您正在旋转的列没有不同的值。当转轴发生时，这会导致重复的列名称，因此spark为它们提供了这些列名称以使它们不同。您需要在转动之前对枢轴列进行分组，以使透视列（操作）中的值不同。

如果您需要更多帮助，请与我联系！

@ hyperc54

Pyspark 1.6 - 使用多个聚合旋转后别名列

1 个答案: