我有兴趣每天捕获特定时段的用户行为。假设我有一个带有列的数据框
+-------+-----+----------+
| start | end | activity |
+-------+-----+----------+
start
和end
均为Unix时间戳。在PySpark中,有什么方法可以过滤特定的时间间隔,例如每天的上午10点到上午11点?请注意,start
可能在10之前开始,end
可能在11之后结束。我想找到所有重叠的时间段。
答案 0 :(得分:0)
您可以从时间戳获取小时数。
from pyspark.sql import functions as F
day_timestamps = 24*60*60
hour_timestamps = 60*60
hour = 10
tmstmp2hour = lambda tm: (tm%day_timestamp)/hour_timestamp
df.filter(
(tmstmp2hour(F.col('start'))<hour)
&
(tmstmp2hour(F.col('end'))>hour)
)
答案 1 :(得分:0)
以下解决方案可解决您的问题
假设您有unix时间戳,请创建一个DataFrame
l1 = [(1541585700,1541585750,'playing'), (1531305900,1541589300, 'fishing'), (1541589400,1541589500,'working'),(1530919800, 1530923400, 'across-night')]
df = sqlContext.createDataFrame(l1, ['start','end','activity'])
df.show()
以24小时制初始化范围,您可以将其作为数据框的一部分来加速
capStart = 23 #11 pm,
capEnd = 1 # 1am
df = df.withColumn('startSec', df.start%(24*60*60))
df = df.withColumn('endSec', df.end%(24*60*60))
df = df.withColumn('match', (df.startSec >= capStart*60*60) & (df.endSec <= capEnd*60*60))
df.show()
+----------+----------+--------+--------+------+-----+
| start| end|activity|startSec|endSec|match|
+----------+----------+--------+--------+------+-----+
|1541585700|1541585750| playing| 36900| 36950|false|
|1531305900|1541589300| fishing| 38700| 40500|false|
|1541589400|1541589500| working| 40600| 40700|false|
|1530919800|1530923400| night| 84600| 1800| true|
+----------+----------+--------+--------+------+-----+
#Filter the result
df = df.filter(df.match == True)
df.show()
+----------+----------+--------+--------+------+-----+
| start| end|activity|startSec|endSec|match|
+----------+----------+--------+--------+------+-----+
|1530919800|1530923400| night| 84600| 1800| true|
+----------+----------+--------+--------+------+-----+