使用Pyspark根据分组依据创建新列

时间:2019-08-29 15:41:23

标签: python python-3.x apache-spark pyspark

我有一个场景,我必须从分组依据中获取结果并创建一个新列。

例如,说我有此数据:

| Tool         | Category   | Price      |
| Hammer       | Hand Tool  | 25.00      |
| Drill        | Power Tool | 56.33      |
| Screw Driver | Hand Tool  | 4.99       |

我的输出应如下所示:

| Tool         | Hand Tool | Power Tool |
| Hammer       | 25.00     | NULL       |
| Drill        | NULL      | 56.33      |
| Screw Driver | 4.99      | NULL       |

我不确定如何获得此输出。我正在尝试下面的代码段,但它用column is not iterable炸掉了。

def get_tool_info():
    return tool_table.groupBy('Category').pivot('Price', 'Category')

动态生成这些新列并分配价格值的最佳方法是什么?

1 个答案:

答案 0 :(得分:4)

尝试一下:

from pyspark.sql.types import StructType, StructField, StringType, FloatType
import pyspark.sql.functions as F

schema = StructType([StructField("Tool", StringType()), StructField("Category", StringType()), StructField("Price", FloatType())])
data = [["Hammer", "Hand Tool", 25.00], ["Drill", "Power Tool", 56.33], ["Screw Driver", "Hand Tool", 4.99]]
df = spark.createDataFrame(data, schema)

df.groupby("Tool").pivot("Category").agg(F.first("Price")).show()

输出:

+------------+---------+----------+
|        Tool|Hand Tool|Power Tool|
+------------+---------+----------+
|       Drill|     null|     56.33|
|Screw Driver|     4.99|      null|
|      Hammer|     25.0|      null|
+------------+---------+----------+