我有一个场景,我必须从分组依据中获取结果并创建一个新列。
例如,说我有此数据:
| Tool | Category | Price |
| Hammer | Hand Tool | 25.00 |
| Drill | Power Tool | 56.33 |
| Screw Driver | Hand Tool | 4.99 |
我的输出应如下所示:
| Tool | Hand Tool | Power Tool |
| Hammer | 25.00 | NULL |
| Drill | NULL | 56.33 |
| Screw Driver | 4.99 | NULL |
我不确定如何获得此输出。我正在尝试下面的代码段,但它用column is not iterable
炸掉了。
def get_tool_info():
return tool_table.groupBy('Category').pivot('Price', 'Category')
动态生成这些新列并分配价格值的最佳方法是什么?
答案 0 :(得分:4)
尝试一下:
from pyspark.sql.types import StructType, StructField, StringType, FloatType
import pyspark.sql.functions as F
schema = StructType([StructField("Tool", StringType()), StructField("Category", StringType()), StructField("Price", FloatType())])
data = [["Hammer", "Hand Tool", 25.00], ["Drill", "Power Tool", 56.33], ["Screw Driver", "Hand Tool", 4.99]]
df = spark.createDataFrame(data, schema)
df.groupby("Tool").pivot("Category").agg(F.first("Price")).show()
输出:
+------------+---------+----------+
| Tool|Hand Tool|Power Tool|
+------------+---------+----------+
| Drill| null| 56.33|
|Screw Driver| 4.99| null|
| Hammer| 25.0| null|
+------------+---------+----------+