我在表/数据框中有数据。
table/dataframe: temptable/temp_df
StoreId,Total_Sales,Date
S1,10000,01-Jan-18
S1,20000,02-Jan-18
S1,25000,03-Jan-18
S1,30000,04-Jan-18
S1,29000,05-Jan-18--> total sales value is decline from previous value(04-jan-18)
S1,28500,06-Jan-18--> total sales value is decline from previous value(05-jan-18)
S1,25500,07-Jan-18--> total sales value is decline from previous value(06-jan-18)(output row)
S1,25500,08-Jan-18--> total sales value is constant from previous value(07-jan-18)
S1,30000,09-Jan-18
S1,29000,10-Jan-18-->same
S1,28000,11-Jan-18-->same
S1,25000,12-Jan-18-->same (output row)
S1,25000,13-Jan-18
S1,30000,14-Jan-18
S1,29000,15-Jan-18
S1,28000,16-Jan-18
所以我希望数据帧/表中的记录连续3次下降。如果总价值具有相同的total_sale,则它将视为既不下降也不增加。
预期输出为:
StoreId,Total_Sales,Date
S1,25500,07-Jan-18
S1,25000,12-Jan-18
答案 0 :(得分:0)
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql import Window
sc = SparkSession.builder.appName("example").\
config("spark.driver.memory","1g").\
config("spark.executor.cores",2).\
config("spark.max.cores",4).getOrCreate()
df = sc.read.format("csv").option("header","true").option("delimiter",",").load("storesales.csv")
w = Window.partitionBy("StoreID").orderBy("Date")
df = df.withColumn("oneprev",f.lag("Total_Sales",1).over(w)).withColumn("twoprev",f.lag("Total_Sales",2).over(w))
df = df.withColumn("isdeclining",f.when((df["Total_Sales"].cast("double") < df["oneprev"].cast("double")) & (df["oneprev"].cast("double") < df["twoprev"].cast("double")) ,"declining").otherwise("notdeclining"))
df = df.withColumn("oneprev_isdeclining",f.lag("isdeclining",1).over(w)).withColumn("twoprev_isdeclining",f.lag("isdeclining",2).over(w))
df = df.filter((df [“ isdeclining”] ==“下降”)&(df [” oneprev_isdeclining“]!=”下降”)&(df [“ twoprev_isdeclining”]!=“下降”) ).select([[“ StoreID”,“ Date”,“ Total_Sales”])
df.show()
您可以将其中的几行合并为一行,但理想情况下,火花sql优化器应予以照顾
Sample input +-------+-----------+---------+
|StoreId|Total_Sales| Date|
+-------+-----------+---------+
| S1| 10000|01-Jan-18|
| S1| 20000|02-Jan-18|
| S1| 25000|03-Jan-18|
| S1| 30000|04-Jan-18|
| S1| 29000|05-Jan-18|
| S1| 28500|06-Jan-18|
| S1| 25500|07-Jan-18|
| S1| 25500|08-Jan-18|
| S1| 30000|09-Jan-18|
| S1| 29000|10-Jan-18|
| S1| 28000|11-Jan-18|
| S1| 25000|12-Jan-18|
| S1| 25000|13-Jan-18|
| S1| 30000|14-Jan-18|
| S1| 29000|15-Jan-18|
+-------+-----------+---------+
Desired Output :
+-------+---------+-----------+
|StoreID| Date|Total_Sales|
+-------+---------+-----------+
| S1|06-Jan-18| 28500|
| S1|11-Jan-18| 28000|
+-------+---------+-----------+