如何在PySpark中过滤固定的固定时间段的记录?

时间:2018-07-09 08:48:38

标签: python dataframe pyspark

我有兴趣每天捕获特定时段的用户行为。假设我有一个带有列的数据框

+-------+-----+----------+
| start | end | activity |
+-------+-----+----------+

startend均为Unix时间戳。在PySpark中,有什么方法可以过滤特定的时间间隔,例如每天的上午10点到上午11点?请注意,start可能在10之前开始,end可能在11之后结束。我想找到所有重叠的时间段。

2 个答案:

答案 0 :(得分:0)

您可以从时间戳获取小时数。

from pyspark.sql import functions as F

day_timestamps = 24*60*60
hour_timestamps = 60*60
hour = 10
tmstmp2hour = lambda tm: (tm%day_timestamp)/hour_timestamp 
df.filter( 
   (tmstmp2hour(F.col('start'))<hour) 
    & 
   (tmstmp2hour(F.col('end'))>hour)
)

答案 1 :(得分:0)

以下解决方案可解决您的问题

假设您有unix时间戳,请创建一个DataFrame

l1 = [(1541585700,1541585750,'playing'), (1531305900,1541589300, 'fishing'), (1541589400,1541589500,'working'),(1530919800, 1530923400, 'across-night')]
df = sqlContext.createDataFrame(l1, ['start','end','activity'])
df.show()

以24小时制初始化范围,您可以将其作为数据框的一部分来加速

capStart = 23 #11 pm, 
capEnd = 1 # 1am

df = df.withColumn('startSec', df.start%(24*60*60))
df = df.withColumn('endSec', df.end%(24*60*60))
df = df.withColumn('match', (df.startSec >= capStart*60*60) & (df.endSec <= capEnd*60*60))
df.show()
+----------+----------+--------+--------+------+-----+
|     start|       end|activity|startSec|endSec|match|
+----------+----------+--------+--------+------+-----+
|1541585700|1541585750| playing|   36900| 36950|false|
|1531305900|1541589300| fishing|   38700| 40500|false|
|1541589400|1541589500| working|   40600| 40700|false|
|1530919800|1530923400|   night|   84600|  1800| true|
+----------+----------+--------+--------+------+-----+
#Filter the result
df = df.filter(df.match == True)
df.show()
+----------+----------+--------+--------+------+-----+
|     start|       end|activity|startSec|endSec|match|
+----------+----------+--------+--------+------+-----+
|1530919800|1530923400|   night|   84600|  1800| true|
+----------+----------+--------+--------+------+-----+