我有一个非常简单的问题。我有很长的ID和时间戳列表,我想按ID计算某些时间窗口中的时间戳。以下是示例数据:
Class<?>[] classes = new Class[] { Students.class, Student.class, Marks.class };
XStream xstream = new XStream();
xstream.autodetectAnnotations(true);
xstream.processAnnotations(Students.class);
Students students = (Students) xstream.fromXML(xml);
这就是我想要的回报:
+---------------+-------------------+
| id| Occurrence|
+---------------+-------------------+
|533ladk203ldpwk|2018-03-28 17:52:04|
|516dlksw9823adp|2018-03-26 12:58:04|
|516dlksw9823adp|2018-01-24 07:52:16|
|533ladk203ldpwk|2018-03-18 03:23:11|
|533ladk203ldpwk|2018-03-14 08:30:13|
+---------------+-------------------+
使用PySpark或SQL有一种简单的方法吗?
答案 0 :(得分:1)
您可以使用pyspark.sql.functions.current_timestamp()
获取当前时间戳,并使用pyspark.sql.functions.datediff()
计算它与"Occurrence"
中的值之间的差异。
例如:
import pyspark.sql.functions as f
df.withColumn('days_since_today', f.datediff(f.current_timestamp(), f.col("Occurrence")))\
.show()
#+---------------+-------------------+----------------+
#| id| Occurrence|days_since_today|
#+---------------+-------------------+----------------+
#|533ladk203ldpwk|2018-03-28 17:52:04| 5|
#|516dlksw9823adp|2018-03-26 12:58:04| 7|
#|516dlksw9823adp|2018-01-24 07:52:16| 68|
#|533ladk203ldpwk|2018-03-18 03:23:11| 15|
#|533ladk203ldpwk|2018-03-14 08:30:13| 19|
#+---------------+-------------------+----------------+
然后,您可以过滤掉符合"days_since_today"
小于或等于30的条件的行,按"id"
分组并计数。
df.withColumn('days_since_today', f.datediff(f.current_timestamp(), f.col("Occurrence")))\
.where("days_since_today <= 30")\
.groupBy('id')\
.agg(f.count("Occurrence").alias("Last30daysOccurrenceCount"))\
.show()
#+---------------+-------------------------+
#| id|Last30daysOccurrenceCount|
#+---------------+-------------------------+
#|533ladk203ldpwk| 3|
#|516dlksw9823adp| 1|
#+---------------+-------------------------+
或等效地,没有中间栏:
df.groupBy('id')\
.agg(
f.sum(
f.when(
f.datediff(f.current_timestamp(), f.col("Occurrence")) <= 30,
1
).otherwise(0)
).alias("Last30daysOccurrenceCount")
)\
.show()
#+---------------+-------------------------+
#| id|Last30daysOccurrenceCount|
#+---------------+-------------------------+
#|533ladk203ldpwk| 3|
#|516dlksw9823adp| 1|
#+---------------+-------------------------+