我有一个具有以下结构的CSV文件
USER_ID location timestamp
1 1001 19:11:39 5-2-2010
1 6022 17:51:19 6-6-2010
1 1041 11:11:39 5-2-2010
2 9483 10:51:23 3-2-2012
2 4532 11:11:11 4-5-2012
3 4374 03:21:23 6-9-2013
3 4334 04:53:13 4-5-2013
基本上,我想使用pyspark或仅使用python来计算具有相同user_id号的不同位置的时间戳差异。预期结果的一个例子是:
USER_ID location timestamp difference
1 1001-1041 08:00:00
任何想法如何找到解决方案
答案 0 :(得分:1)
假设您希望为用户提供各种可能的位置组合,只需对USER_ID进行联接,然后减去日期列。这里的一个技巧是使用unix_timestamp将日期时间数据解析为支持减法运算的整数。
示例代码:
from pyspark.sql.functions import unix_timestamp, col, datediff
data = [
(1, 1001, '19:11:39 5-2-2010'),
(1, 6022, '17:51:19 6-6-2010'),
(1, 1041, '11:11:39 5-2-2010'),
(2, 9483, '10:51:23 3-2-2012'),
(2, 4532, '11:11:11 4-5-2012'),
(3, 4374, '03:21:23 6-9-2013'),
(3, 4334, '04:53:13 4-5-2013')
]
df = spark.createDataFrame(data, ['USER_ID', 'location', 'timestamp'])
df = df.withColumn('timestamp', unix_timestamp('timestamp', 'HH:mm:ss dd-MM-yyyy'))
# Renaming columns to avoid conflicts after join
df2 = df.selectExpr('USER_ID as USER_ID2', 'location as location2', 'timestamp as timestamp2')
cartesian = df.join(df2, col("USER_ID") == col("USER_ID2"), "inner")
# Filter to get rid of reversed duplicates, and rows where location is same on both sides
pairs = cartesian.filter("location < location2") \
.drop("USER_ID2") \
.withColumn("diff", col("timestamp2") - col("timestamp"))
pairs.show()