Question

我有一个非常简单的问题。我有很长的ID和时间戳列表，我想按ID计算某些时间窗口中的时间戳。以下是示例数据：

        Class<?>[] classes = new Class[] { Students.class, Student.class, Marks.class };
        XStream xstream = new XStream();
        xstream.autodetectAnnotations(true);
        xstream.processAnnotations(Students.class);
        Students students = (Students) xstream.fromXML(xml);

这就是我想要的回报：

+---------------+-------------------+
|             id|         Occurrence|
+---------------+-------------------+
|533ladk203ldpwk|2018-03-28 17:52:04|
|516dlksw9823adp|2018-03-26 12:58:04|
|516dlksw9823adp|2018-01-24 07:52:16|
|533ladk203ldpwk|2018-03-18 03:23:11|
|533ladk203ldpwk|2018-03-14 08:30:13|
+---------------+-------------------+

使用PySpark或SQL有一种简单的方法吗？

Answer 1

您可以使用pyspark.sql.functions.current_timestamp()获取当前时间戳，并使用pyspark.sql.functions.datediff()计算它与"Occurrence"中的值之间的差异。

例如：

import pyspark.sql.functions as f
df.withColumn('days_since_today', f.datediff(f.current_timestamp(), f.col("Occurrence")))\
    .show()
#+---------------+-------------------+----------------+
#|             id|         Occurrence|days_since_today|
#+---------------+-------------------+----------------+
#|533ladk203ldpwk|2018-03-28 17:52:04|               5|
#|516dlksw9823adp|2018-03-26 12:58:04|               7|
#|516dlksw9823adp|2018-01-24 07:52:16|              68|
#|533ladk203ldpwk|2018-03-18 03:23:11|              15|
#|533ladk203ldpwk|2018-03-14 08:30:13|              19|
#+---------------+-------------------+----------------+

然后，您可以过滤掉符合"days_since_today"小于或等于30的条件的行，按"id"分组并计数。

df.withColumn('days_since_today', f.datediff(f.current_timestamp(), f.col("Occurrence")))\
    .where("days_since_today <= 30")\
    .groupBy('id')\
    .agg(f.count("Occurrence").alias("Last30daysOccurrenceCount"))\
    .show()
#+---------------+-------------------------+
#|             id|Last30daysOccurrenceCount|
#+---------------+-------------------------+
#|533ladk203ldpwk|                        3|
#|516dlksw9823adp|                        1|
#+---------------+-------------------------+

或等效地，没有中间栏：

df.groupBy('id')\
    .agg(
        f.sum(
            f.when(
                f.datediff(f.current_timestamp(), f.col("Occurrence")) <= 30,
                1
            ).otherwise(0)
        ).alias("Last30daysOccurrenceCount")
    )\
    .show()
#+---------------+-------------------------+
#|             id|Last30daysOccurrenceCount|
#+---------------+-------------------------+
#|533ladk203ldpwk|                        3|
#|516dlksw9823adp|                        1|
#+---------------+-------------------------+

从当前时间 - pyspark计算N天内的发生次数

1 个答案: