在Spark SQL上查找total_sale的连续下降

时间:2019-10-31 05:39:24

标签: scala apache-spark apache-spark-sql

我在表/数据框中有数据。

table/dataframe: temptable/temp_df 
StoreId,Total_Sales,Date
S1,10000,01-Jan-18
S1,20000,02-Jan-18
S1,25000,03-Jan-18
S1,30000,04-Jan-18
S1,29000,05-Jan-18--> total sales value is decline from previous value(04-jan-18)
S1,28500,06-Jan-18--> total sales value is decline from previous value(05-jan-18)
S1,25500,07-Jan-18--> total sales value is decline from previous value(06-jan-18)(output row)
S1,25500,08-Jan-18--> total sales value is constant from previous value(07-jan-18)
S1,30000,09-Jan-18
S1,29000,10-Jan-18-->same
S1,28000,11-Jan-18-->same
S1,25000,12-Jan-18-->same (output row)
S1,25000,13-Jan-18
S1,30000,14-Jan-18
S1,29000,15-Jan-18 
S1,28000,16-Jan-18 

所以我希望数据帧/表中的记录连续3次下降。如果总价值具有相同的total_sale,则它将视为既不下降也不增加。

预期输出为:

StoreId,Total_Sales,Date
S1,25500,07-Jan-18
S1,25000,12-Jan-18

1 个答案:

答案 0 :(得分:0)

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql import Window

sc = SparkSession.builder.appName("example").\
config("spark.driver.memory","1g").\
config("spark.executor.cores",2).\
config("spark.max.cores",4).getOrCreate()



df = sc.read.format("csv").option("header","true").option("delimiter",",").load("storesales.csv")

w = Window.partitionBy("StoreID").orderBy("Date")


df = df.withColumn("oneprev",f.lag("Total_Sales",1).over(w)).withColumn("twoprev",f.lag("Total_Sales",2).over(w))



df = df.withColumn("isdeclining",f.when((df["Total_Sales"].cast("double") < df["oneprev"].cast("double")) & (df["oneprev"].cast("double") < df["twoprev"].cast("double")) ,"declining").otherwise("notdeclining"))
df = df.withColumn("oneprev_isdeclining",f.lag("isdeclining",1).over(w)).withColumn("twoprev_isdeclining",f.lag("isdeclining",2).over(w))

df = df.filter((df [“ isdeclining”] ==“下降”)&(df [” oneprev_isdeclining“]!=”下降”)&(df [“ twoprev_isdeclining”]!=“下降”) ).select([[“ StoreID”,“ Date”,“ Total_Sales”])

df.show()

您可以将其中的几行合并为一行,但理想情况下,火花sql优化器应予以照顾

Sample input +-------+-----------+---------+
|StoreId|Total_Sales|     Date|
+-------+-----------+---------+
|     S1|      10000|01-Jan-18|
|     S1|      20000|02-Jan-18|
|     S1|      25000|03-Jan-18|
|     S1|      30000|04-Jan-18|
|     S1|      29000|05-Jan-18|
|     S1|      28500|06-Jan-18|
|     S1|      25500|07-Jan-18|
|     S1|      25500|08-Jan-18|
|     S1|      30000|09-Jan-18|
|     S1|      29000|10-Jan-18|
|     S1|      28000|11-Jan-18|
|     S1|      25000|12-Jan-18|
|     S1|      25000|13-Jan-18|
|     S1|      30000|14-Jan-18|
|     S1|      29000|15-Jan-18|
+-------+-----------+---------+    

Desired Output :

+-------+---------+-----------+
|StoreID|     Date|Total_Sales|
+-------+---------+-----------+
|     S1|06-Jan-18|      28500|
|     S1|11-Jan-18|      28000|
+-------+---------+-----------+