Question

我有类似以下示例数据的数据。我正在尝试使用PySpark在数据中创建一个新列，该列将基于时间戳作为客户的第一个事件的类别。像下面的示例输出数据一样。

下面我有一个示例，说明使用sql中的窗口函数可以完成此操作。

我刚接触PySpark。我了解您可以在PySpark中运行sql。我想知道我下面的代码是否正确，可以在PySpark中运行sql窗口函数。那就是我想知道是否可以将sql代码粘贴到spark.sql内，如下所示。

输入：

eventid customerid category timestamp
1       3          a        1/1/12
2       3          b        2/3/14
4       2          c        4/1/12

输出：

eventid customerid category timestamp first_event
1       3          a        1/1/12    a
2       3          b        2/3/14    a
4       2          c        4/1/12    c

窗口功能示例：

select eventid, customerid, category, timestamp 
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table



# implementing window function example with pyspark

PySpark:
# Note: assume df is dataframe with structure of table above
# (df is table)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“Operations”).getOrCreate()

# Register the DataFrame as a SQL temporary view

df.createOrReplaceView(“Table”)

sql_results = spark.sql(“select eventid, customerid, category, timestamp 
                FIRST_VALUE(catgegory) over(partition by customerid order by                timestamp) first_event
                from table”)

# display results
sql_results.show()

Answer 1

您也可以在pyspark中使用窗口功能

>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>> 
>>> df.show()
+-------+----------+--------+---------+
|eventid|customerid|category|timestamp|
+-------+----------+--------+---------+
|      1|         3|       a|   1/1/12|
|      2|         3|       b|   2/3/14|
|      4|         2|       c|   4/1/12|
+-------+----------+--------+---------+

>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>> 
>>> df.show()
+-------+----------+--------+---------+-----------+                             
|eventid|customerid|category|timestamp|first_event|
+-------+----------+--------+---------+-----------+
|      1|         3|       a|   1/1/12|          a|
|      2|         3|       b|   2/3/14|          a|
|      4|         2|       c|   4/1/12|          c|
+-------+----------+--------+---------+-----------+

在pyspark内部使用窗口函数sql

1 个答案: