Question

我有以下格式的数据：

 +------+-------------+-----------------+--------------------+
 |Serial|       respID|     VariableName|            Response|
 +------+-------------+-----------------+--------------------+
 |    11|1_10001070394|Respondent.Serial|                  11|
 |    11|1_10001070394|Respondent.Origin|Interviewer Serve...|
 |    11|1_10001070394|              AGE|                  48|
 |    11|1_10001070394|              SEX|                Male|
 |    11|1_10001070394|             Eth1|                  No|
 +------+-------------+-----------------+--------------------+

我需要将其转换为以下格式

+------+-------------+-----------------+--------------------+---------+---------+-------+
|Serial|       respID|Respondent.Serial|   Respondent.Origin|      AGE|      SEX|   Eth1|
+------+-------------+-----------------+--------------------+---------+---------+-------+
|    11|1_10001070394|               11|Interviewer Serve...|       48|     Male|     No|

我可以通过以下代码在python中针对较小的数据集进行此操作-

df.groupby(['respID','Serial']).apply(lambda 
x:x.pivot(columns='VariableName', values='Response')).reset_index(). 
groupby(['respID','Serial']).first()

但是当我尝试使用PySpark 2.4（在DataBricks中）时，看起来GroupedData对象不支持提取1st Not Null值。

我尝试了以下

df.groupBy(['respID','Serial']).pivot('VariableName',['Response'])

它创建一个GroupedData对象，但不支持转换为pyspark数据框。

Answer 1

from pyspark.sql.functions import  expr 
x=File.groupBy("respID","Serial").pivot("VariableName").agg(expr("coalesce(first(Response),'')"))

2136列中的数据过多

Pyspark 2.4中的GroupedData对象

1 个答案: