Pyspark:如何在第一次看到变量时创建标志

时间:2018-06-11 21:04:18

标签: python apache-spark pyspark

我有一组变量,即时间戳和会话。如何在第一次看到会话时创建一个为1的新会话指示符,然后为该会话的每个实例创建0。例如......

from pyspark.sql import functions as F

df = sqlContext.createDataFrame([
        ("a", "44", "2018-01-08 09:01:01.085193"),
        ("a", "44", "2018-01-08 09:01:01.086280"),
        ("a", "44", "2018-01-08 09:01:01.087681"),
        ("a", "95", "2018-01-15 12:01:01.544710"),
        ("a", "95", "2018-01-15 13:01:01.545991"),
], ["person_id", "session_id", "timestamp"])

df = df.withColumn('unix_ts',F.unix_timestamp(df.timestamp, 'yyyy-MM-dd HH:mm:ss'))
df = df.withColumn("DayOfWeek",F.date_format(df.timestamp, 'EEEE'))
df.show()

产量

+---------+----------+--------------------+----------+---------+
|person_id|session_id|           timestamp|   unix_ts|DayOfWeek|
+---------+----------+--------------------+----------+---------+
|        a|        44|2018-01-08 09:01:...|1515423661|   Monday|
|        a|        44|2018-01-08 09:01:...|1515423661|   Monday|
|        a|        44|2018-01-08 09:01:...|1515423661|   Monday|
|        a|        95|2018-01-15 12:01:...|1516039261|   Monday|
|        a|        95|2018-01-15 13:01:...|1516042861|   Monday|
+---------+----------+--------------------+----------+---------+

我想添加一个列来提供此输出:

+---------+----------+--------------------+----------+---------+----------+
|person_id|session_id|           timestamp|   unix_ts|DayOfWeek| FirstInd |   
+---------+----------+--------------------+----------+---------+----------+
|        a|        44|2018-01-08 09:01:...|1515423661|   Monday|     1    |
|        a|        44|2018-01-08 09:01:...|1515423661|   Monday|     0    |
|        a|        44|2018-01-08 09:01:...|1515423661|   Monday|     0    |
|        a|        95|2018-01-15 12:01:...|1516039261|   Monday|     1    |
|        a|        95|2018-01-15 13:01:...|1516042861|   Monday|     0    |
+---------+----------+--------------------+----------+---------+----------+

2 个答案:

答案 0 :(得分:0)

以下适用于我。虽然它在技术上不是一个标志,但你确实知道哪一行是第1行。     df.withColumn("的rowNum",F.row_number()以上(Window.partitionBy('为person_id'' SESSION_ID'。)ORDERBY(" unix_ts")))

答案 1 :(得分:0)

您可以尝试一下。

from pyspark.sql import Window

window = Window.partitionBy('person_id','session_id').orderBy("unix_ts")   
df = df.withColumn("FirstInd",F.when(F.row_number().over(window) == 1, 1).otherwise(0))