我有类似以下示例数据的数据。我正在尝试使用PySpark在数据中创建一个新列,该列将基于时间戳作为客户的第一个事件的类别。像下面的示例输出数据一样。
下面我有一个示例,说明使用sql中的窗口函数可以完成此操作。
我刚接触PySpark。我了解您可以在PySpark中运行sql。我想知道我下面的代码是否正确,可以在PySpark中运行sql窗口函数。那就是我想知道是否可以将sql代码粘贴到spark.sql内,如下所示。
输入:
eventid customerid category timestamp
1 3 a 1/1/12
2 3 b 2/3/14
4 2 c 4/1/12
输出:
eventid customerid category timestamp first_event
1 3 a 1/1/12 a
2 3 b 2/3/14 a
4 2 c 4/1/12 c
窗口功能示例:
select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table
# implementing window function example with pyspark
PySpark:
# Note: assume df is dataframe with structure of table above
# (df is table)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“Operations”).getOrCreate()
# Register the DataFrame as a SQL temporary view
df.createOrReplaceView(“Table”)
sql_results = spark.sql(“select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table”)
# display results
sql_results.show()
答案 0 :(得分:1)
您也可以在pyspark中使用窗口功能
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>>
>>> df.show()
+-------+----------+--------+---------+
|eventid|customerid|category|timestamp|
+-------+----------+--------+---------+
| 1| 3| a| 1/1/12|
| 2| 3| b| 2/3/14|
| 4| 2| c| 4/1/12|
+-------+----------+--------+---------+
>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>>
>>> df.show()
+-------+----------+--------+---------+-----------+
|eventid|customerid|category|timestamp|first_event|
+-------+----------+--------+---------+-----------+
| 1| 3| a| 1/1/12| a|
| 2| 3| b| 2/3/14| a|
| 4| 2| c| 4/1/12| c|
+-------+----------+--------+---------+-----------+