我有一个数据集,如下所示:
id email Date_of_purchase time_of_purchase
1 abc@gmail.com 11/10/18 12:10 PM
2 abc@gmail.com 11/10/18 02:11 PM
3 abc@gmail.com 11/10/18 03:14 PM
4 abc@gmail.com 11/11/18 06:16 AM
5 abc@gmail.com 11/11/18 09:10 AM
6 def@gmail.com 11/10/18 12:17 PM
7 def@gmail.com 11/10/18 03:24 PM
8 def@gmail.com 11/10/18 08:16 PM
9 def@gmail.com 11/10/18 09:13 PM
10 def@gmail.com 11/11/18 12:01 AM
我想计算每个电子邮件ID在4小时内进行的交易数量。例如,电子邮件ID:abc@gmail.com从18/11/10 12.10 PM到11/10/18 4.10 PM进行了3笔交易,从11/11/18 6.16 AM到11/11/18进行了2笔交易。上午10.16。电子邮件ID:def@gmail.com从11/10/18 12.17 PM到11/10/18 4.17 PM进行了2笔交易,从11/10/18 8.16 PM到11/11/18 12.16 AM进行了3笔交易。
我想要的输出是:
email hour_interval purchase_in_4_hours
abc@gmail.com [11/10/18 12.10 PM to 11/10/18 4.10 PM] 3
abc@gmail.com [11/11/18 6.16 AM to 11/11/18 10.16 AM] 2
def@gmail.com [11/10/18 12.17 PM to 11/10/18 4.17 PM] 2
def@gmail.com [11/10/18 8.16 PM to 11/11/18 12.16 AM] 3
我的数据集有1000k行。我是火花的新手。任何帮助将不胜感激。 附言时间间隔可以从4小时更改为1小时,6小时,1天等。
TIA。
答案 0 :(得分:5)
这个想法是通过电子邮件对数据进行分区,在每个分区中按日期和时间排序,然后将每个分区映射到所需的输出。如果每个分区的数据(=一个电子邮件地址的数据)适合一个Spark执行程序的内存,则此方法将起作用。
主动Spark逻辑遵循以下步骤
clang++ -std=c++11 <filename>
输出:
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from datetime import datetime, timedelta
spark = SparkSession.builder.appName("test").getOrCreate()
df = spark.read.option("header", "true").csv(<path>) #or any other data source
df = df.withColumn("date_time", to_timestamp(concat(col("Date_of_purchase"), lit(" "), col("time_of_purchase")), "MM/dd/yy hh:mm aa")) \
.drop("Date_of_purchase", "time_of_purchase") \
.repartition(col("email")) \
.sortWithinPartitions(col("email"), col("date_time"))
def process_partition(df_chunk):
row_list = list(df_chunk)
if len(row_list) == 0:
return
email = row_list[0]['email']
start = row_list[0]['date_time']
end = start + timedelta(hours=4)
count = 0
for row in row_list:
if email == row['email'] and end > row['date_time']:
count = count +1
else:
yield Row(email, start, end, count)
email = row['email']
start = row['date_time']
end = start + timedelta(hours=4)
count = 1
yield Row(email, start, end, count)
result = df.rdd.mapPartitions(process_partition).toDF(["email", "from", "to", "count"])
result.show()
要更改时间段的长度,可以将+-------------+-------------------+-------------------+-----+
| email| from| to|count|
+-------------+-------------------+-------------------+-----+
|def@gmail.com|2018-11-10 12:17:00|2018-11-10 16:17:00| 2|
|def@gmail.com|2018-11-10 20:16:00|2018-11-11 00:16:00| 3|
|abc@gmail.com|2018-11-10 12:10:00|2018-11-10 16:10:00| 3|
|abc@gmail.com|2018-11-11 06:16:00|2018-11-11 10:16:00| 2|
+-------------+-------------------+-------------------+-----+
设置为任何值。