pyspark按用户ID计算会话持续时间组

时间:2017-07-27 17:53:01

标签: session hadoop apache-spark pyspark

我尝试使用pyspark计算每个用户ID的会话持续时间,数据样本如下:

from pyspark.sql import SQLContext, functions

df_session.select(df_session.userid, df_session.platform, functions.when(time_difference > 2000) THEN previousTime).otherwise(currentTime)

df_session.select(df_session.userid, df_session.platform, functions.when(time_difference is null) THEN currentTime).otherwise(previousTime)
  1. 我想按用户ID和平台进行分组
  2. 然后我想在该组中制作currentTime == previousTime(如果timeDifference> 2000或timeDifference == null),我在下面尝试过:

    |userid|platform            |previousTime           |currentTime            |timeDifference |
    |1234  |13                  |2017-07-20 10:49:30.027|2017-07-20 10:49:30.027|0              |
    |1234  |13                  |2017-07-20 10:04:23.1  |2017-07-20 10:04:23.1  |0              |
    |1234  |13                  |2017-07-20 10:04:23.1  |2017-07-20 10:06:23.897|120            |
    |1234  |13                  |2017-07-20 10:04:23.897|2017-07-20 10:04:23.897|0              |
    |1234  |13                  |2017-07-20 10:40:29.472|2017-07-20 10:51:17.427|658            |
    
  3. 然后我想将所有timeDifference加起来,如果它小于2000并使currentTime添加TotalTimeDifference。所以结果就像:

    col regexp '^C:/Data/[^/]?.txt$'
    
  4. 最后一部分非常棘手,我不知道从哪里开始。谢谢。

1 个答案:

答案 0 :(得分:1)

希望这有帮助!

import pyspark.sql.functions as func
from datetime import datetime, timedelta
from pyspark.sql.types import StringType

df = sc.parallelize([('1234','13','','2017-07-20 10:49:30.027',''),
                    ('1234','13','','2017-07-20 10:04:23.100',''),
                    ('1234','13','2017-07-20 10:04:23.100','2017-07-20 10:06:23.897',120),
                    ('1234','13','2017-07-20 10:04:23.897','2017-07-20 10:40:29.472',2166),
                    ('1234','13','2017-07-20 10:40:29.472','2017-07-20 10:40:50.347',11),
                    ('1234','13','2017-07-20 10:40:30.347','2017-07-20 10:51:16.458',646),
                    ('1234','13','2017-07-20 10:51:16.458','2017-07-20 10:51:17.427',1),
                    ('7777','44','2017-07-20 10:31:16.458','2017-07-20 10:47:16.458',1000),
                    ('7777','44','2017-07-20 11:11:16.458','2017-07-20 11:36:16.458',1500),
                    ('678','56','2017-07-20 10:51:16.458','2017-07-20 10:51:36.458',20),
                    ('678','56','2017-07-20 10:51:16.458','2017-07-20 10:51:26.458',10)
                    ]).\
    toDF(['userid','platform','previousTime','currentTime','timeDifference'])
df.show()

# missing value & outlier treatment
df1 = df.select("userid","platform", func.when(df.timeDifference=='', df.currentTime).otherwise(df.previousTime),
                func.when(df.timeDifference > 2000, df.previousTime).otherwise(df.currentTime),
                func.when(df.timeDifference=='', 0).when(df.timeDifference > 2000, 0).otherwise(df.timeDifference))
oldColumns = df1.schema.names
newColumns = ["userid", "platform", "previousTime", "currentTime", "timeDifference"]
df1 = reduce(lambda df1, idx: df1.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), df1)
df1.show()

# first part of result i.e. records where timeDifference = 0
df_final_part0 = df1.where("timeDifference = 0")

# identify records where sum(timeDifference) < 2000
df2 = df1.where("timeDifference <> 0")
df3 = df2.groupby("userid","platform").agg(func.sum("timeDifference")).\
    withColumnRenamed("sum(timeDifference)", "sum_timeDifference").where("sum_timeDifference < 2000")

# second part of result i.e. records where sum(timeDifference) is >= 2000
df_final_part1 = df2.join(df3, ["userid","platform"],"leftanti")

# third part of result
df_final_part2 = df2.join(df3,on=['userid','platform']).select('userid','platform',"previousTime","sum_timeDifference").\
    groupBy('userid','platform',"sum_timeDifference").agg(func.min("previousTime")).\
    withColumnRenamed("min(previousTime)", "previousTime").withColumnRenamed("sum_timeDifference", "timeDifference")
def processdate(x, time_in_sec):
    x = datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')
    x += timedelta(milliseconds= time_in_sec * 1e3)
    return x.strftime('%Y-%m-%d %H:%M:%S.%f')
f1 = func.udf(processdate,StringType())
df_final_part2 = df_final_part2.withColumn("currentTime",f1(df_final_part2.previousTime,df_final_part2.timeDifference)).\
    select('userid','platform',"previousTime","currentTime","timeDifference")

# combine all three parts to get the final result
result = df_final_part0.unionAll(df_final_part1).unionAll(df_final_part2)
result.show()


如果它解决了您的问题,请不要忘记告诉我们:)