根据pyspark中的时间间隔对数据进行分组

时间:2017-03-23 12:34:21

标签: apache-spark pyspark apache-spark-sql spark-dataframe

我正在尝试分组和汇总数据。我已经根据日期和其他领域对其进行了分组,因为它非常直接。现在我也尝试根据时间间隔[Server_Time]

对其进行分组
EventID AccessReason    Source  Server_Date Server_Time
847495004   Granted ORSB_GND_GYM_IN 10/1/2016   7:25:52 AM
847506432   Granted ORSB_GND_GYM_IN 10/1/2016   8:53:38 AM
847512725   Granted ORSB_GND_GYM_IN 10/1/2016   10:18:50 AM
847512768   Granted ORSB_GND_GYM_IN 10/1/2016   10:19:32 AM
847513357   Granted ORSB_GND_GYM_OUT 10/1/2016  10:25:36 AM
847513614   Granted ORSB_GND_GYM_IN 10/1/2016   10:28:08 AM
847515838   Granted ORSB_GND_GYM_OUT 10/1/2016  10:57:41 AM
847522522   Granted ORSB_GND_GYM_IN 10/1/2016   11:57:10 AM

例如。我需要汇总每小时的事件数。根据我们可以看到的数据,对于10小时-11小时,Source' ORSB_GND_GYM_IN'是3,对于' ORSB_GND_GYM_OUT'是2.如何在pyspark中执行此操作

2 个答案:

答案 0 :(得分:2)

您可以使用Udfs将时间转换为范围,然后按

进行分组
from pyspark.sql.functions import udf
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
def getInterval(time):
    start = int(time.split(":")[0])
    return str(start)+"-"+str(start+1)+" "+time.split(" ")[1]

getIntervalUdf = udf(getInterval,StringType())

spark = SparkSession.builder.appName("appName").getOrCreate()
df = spark.read.csv("emp",sep=",",header=True)
df.show()
df = df.withColumn("Interval",getIntervalUdf("Server_Time"))
df.show()
df = df.groupby("Server_Date","Interval","Source").count()
df.show()

输出

+-----------+--------------+------------------+-------------+-------------+
|  EventID  | AccessReason |      Source      | Server_Date | Server_Time |
+-----------+--------------+------------------+-------------+-------------+
| 847495004 | Granted      | ORSB_GND_GYM_IN  | 10/1/2016   | 7:25:52 AM  |
| 847506432 | Granted      | ORSB_GND_GYM_IN  | 10/1/2016   | 8:53:38 AM  |
| 847512725 | Granted      | ORSB_GND_GYM_IN  | 10/1/2016   | 10:18:50 AM |
| 847512768 | Granted      | ORSB_GND_GYM_IN  | 10/1/2016   | 10:19:32 AM |
| 847513357 | Granted      | ORSB_GND_GYM_OUT | 10/1/2016   | 10:25:36 AM |
| 847513614 | Granted      | ORSB_GND_GYM_IN  | 10/1/2016   | 10:28:08 AM |
| 847515838 | Granted      | ORSB_GND_GYM_OUT | 10/1/2016   | 10:57:41 AM |
| 847522522 | Granted      | ORSB_GND_GYM_IN  | 10/1/2016   | 11:57:10 AM |
+-----------+--------------+------------------+-------------+-------------+

+---------+------------+----------------+-----------+-----------+--------+
|  EventID|AccessReason|          Source|Server_Date|Server_Time|Interval|
+---------+------------+----------------+-----------+-----------+--------+
|847495004|     Granted| ORSB_GND_GYM_IN|  10/1/2016| 7:25:52 AM|  7-8 AM|
|847506432|     Granted| ORSB_GND_GYM_IN|  10/1/2016| 8:53:38 AM|  8-9 AM|
|847512725|     Granted| ORSB_GND_GYM_IN|  10/1/2016|10:18:50 AM|10-11 AM|
|847512768|     Granted| ORSB_GND_GYM_IN|  10/1/2016|10:19:32 AM|10-11 AM|
|847513357|     Granted|ORSB_GND_GYM_OUT|  10/1/2016|10:25:36 AM|10-11 AM|
|847513614|     Granted| ORSB_GND_GYM_IN|  10/1/2016|10:28:08 AM|10-11 AM|
|847515838|     Granted|ORSB_GND_GYM_OUT|  10/1/2016|10:57:41 AM|10-11 AM|
|847522522|     Granted| ORSB_GND_GYM_IN|  10/1/2016|11:57:10 AM|11-12 AM|
+---------+------------+----------------+-----------+-----------+--------+

+-----------+--------+----------------+-----+
|Server_Date|Interval|          Source|count|
+-----------+--------+----------------+-----+
|  10/1/2016|10-11 AM| ORSB_GND_GYM_IN|    3|
|  10/1/2016|  8-9 AM| ORSB_GND_GYM_IN|    1|
|  10/1/2016|10-11 AM|ORSB_GND_GYM_OUT|    2|
|  10/1/2016|11-12 AM| ORSB_GND_GYM_IN|    1|
|  10/1/2016|  7-8 AM| ORSB_GND_GYM_IN|    1|
+-----------+--------+----------------+-----+

答案 1 :(得分:0)

用于生成每日/每小时和10分钟间隔的计数

from pyspark.sql.functions import udf
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType


def getHrInterval(time):
    start = int(time.split(":")[0])
    return str(start)+"-"+str(start+1)+" "+time.split(" ")[1]


def getMinInterval(time):
    hr_start = int(time.split(":")[0])
    min_start = int(str(int(time.split(":")[1])/10)+'0')
    return str(hr_start)+":"+str(min_start)+"-"+str(hr_start)+":"+str(min_start+10)+" "+time.split(" ")[1]

path = '/media/sf_VM_Shared/part-00000'
df = sqlContext.read\
    .format("com.databricks.spark.csv")\
    .option("header", "true")\
    .load(path)

getHrIntervalUdf = udf(getHrInterval, StringType())
getMinIntervalUdf = udf(getMinInterval, StringType())
df = df.withColumn("HourInterval", getHrIntervalUdf("Server_Time")).withColumn("MinInterval", getMinIntervalUdf("Server_Time"))
df = df.groupby("Server_Date", "HourInterval", "MinInterval", "Source").count()
df.show()