我尝试使用pyspark计算每个用户ID的会话持续时间,数据样本如下:
from pyspark.sql import SQLContext, functions
df_session.select(df_session.userid, df_session.platform, functions.when(time_difference > 2000) THEN previousTime).otherwise(currentTime)
df_session.select(df_session.userid, df_session.platform, functions.when(time_difference is null) THEN currentTime).otherwise(previousTime)
然后我想在该组中制作currentTime == previousTime(如果timeDifference> 2000或timeDifference == null),我在下面尝试过:
|userid|platform |previousTime |currentTime |timeDifference |
|1234 |13 |2017-07-20 10:49:30.027|2017-07-20 10:49:30.027|0 |
|1234 |13 |2017-07-20 10:04:23.1 |2017-07-20 10:04:23.1 |0 |
|1234 |13 |2017-07-20 10:04:23.1 |2017-07-20 10:06:23.897|120 |
|1234 |13 |2017-07-20 10:04:23.897|2017-07-20 10:04:23.897|0 |
|1234 |13 |2017-07-20 10:40:29.472|2017-07-20 10:51:17.427|658 |
然后我想将所有timeDifference加起来,如果它小于2000并使currentTime添加TotalTimeDifference。所以结果就像:
col regexp '^C:/Data/[^/]?.txt$'
最后一部分非常棘手,我不知道从哪里开始。谢谢。
答案 0 :(得分:1)
希望这有帮助!
import pyspark.sql.functions as func
from datetime import datetime, timedelta
from pyspark.sql.types import StringType
df = sc.parallelize([('1234','13','','2017-07-20 10:49:30.027',''),
('1234','13','','2017-07-20 10:04:23.100',''),
('1234','13','2017-07-20 10:04:23.100','2017-07-20 10:06:23.897',120),
('1234','13','2017-07-20 10:04:23.897','2017-07-20 10:40:29.472',2166),
('1234','13','2017-07-20 10:40:29.472','2017-07-20 10:40:50.347',11),
('1234','13','2017-07-20 10:40:30.347','2017-07-20 10:51:16.458',646),
('1234','13','2017-07-20 10:51:16.458','2017-07-20 10:51:17.427',1),
('7777','44','2017-07-20 10:31:16.458','2017-07-20 10:47:16.458',1000),
('7777','44','2017-07-20 11:11:16.458','2017-07-20 11:36:16.458',1500),
('678','56','2017-07-20 10:51:16.458','2017-07-20 10:51:36.458',20),
('678','56','2017-07-20 10:51:16.458','2017-07-20 10:51:26.458',10)
]).\
toDF(['userid','platform','previousTime','currentTime','timeDifference'])
df.show()
# missing value & outlier treatment
df1 = df.select("userid","platform", func.when(df.timeDifference=='', df.currentTime).otherwise(df.previousTime),
func.when(df.timeDifference > 2000, df.previousTime).otherwise(df.currentTime),
func.when(df.timeDifference=='', 0).when(df.timeDifference > 2000, 0).otherwise(df.timeDifference))
oldColumns = df1.schema.names
newColumns = ["userid", "platform", "previousTime", "currentTime", "timeDifference"]
df1 = reduce(lambda df1, idx: df1.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), df1)
df1.show()
# first part of result i.e. records where timeDifference = 0
df_final_part0 = df1.where("timeDifference = 0")
# identify records where sum(timeDifference) < 2000
df2 = df1.where("timeDifference <> 0")
df3 = df2.groupby("userid","platform").agg(func.sum("timeDifference")).\
withColumnRenamed("sum(timeDifference)", "sum_timeDifference").where("sum_timeDifference < 2000")
# second part of result i.e. records where sum(timeDifference) is >= 2000
df_final_part1 = df2.join(df3, ["userid","platform"],"leftanti")
# third part of result
df_final_part2 = df2.join(df3,on=['userid','platform']).select('userid','platform',"previousTime","sum_timeDifference").\
groupBy('userid','platform',"sum_timeDifference").agg(func.min("previousTime")).\
withColumnRenamed("min(previousTime)", "previousTime").withColumnRenamed("sum_timeDifference", "timeDifference")
def processdate(x, time_in_sec):
x = datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')
x += timedelta(milliseconds= time_in_sec * 1e3)
return x.strftime('%Y-%m-%d %H:%M:%S.%f')
f1 = func.udf(processdate,StringType())
df_final_part2 = df_final_part2.withColumn("currentTime",f1(df_final_part2.previousTime,df_final_part2.timeDifference)).\
select('userid','platform',"previousTime","currentTime","timeDifference")
# combine all three parts to get the final result
result = df_final_part0.unionAll(df_final_part1).unionAll(df_final_part2)
result.show()
如果它解决了您的问题,请不要忘记告诉我们:)