如何使用spark Dataframe创建数十年的窗口?

时间:2018-03-29 03:57:34

标签: apache-spark apache-spark-sql spark-dataframe

示例数据集:

1990;111;Tie Me Up! Tie Me Down!;Comedy;Banderas, Antonio;Abril, Victoria;Almod�var, Pedro;68;No;NicholasCage.png
1991;113;High Heels;Comedy;Bos�, Miguel;Abril, Victoria;Almod�var, Pedro;68;No;NicholasCage.png
1983;104;Dead Zone, The;Horror;Walken, Christopher;Adams, Brooke;Cronenberg, David;79;No;NicholasCage.png
1979;122;Cuba;Action;Connery, Sean;Adams, Brooke;Lester, Richard;6;No;seanConnery.png
1978;94;Days of Heaven;Drama;Gere, Richard;Adams, Brooke;Malick, Terrence;14;No;NicholasCage.png
1983;140;Octopussy;Action;Moore, Roger;Adams, Maud;Glen, John;68;No;NicholasCage.png
1984;101;Target Eagle;Action;Connors, Chuck;Adams, Maud;Loma, Jos� Antonio de la;14;No;NicholasCage.png
1989;99;American Angels: Baptism of Blood, The;Drama;Bergen, Robert D.;Adams, Trudy;Sebastian, Beverly;28;No;NicholasCage.png

问题:这里的列是我的年份,使用这一栏我想创建一个像1990-2000,2000-2010等几十年的窗口。我知道有一个窗口功能可用于数据框架但我是不知道如何将窗口创建10年(十年)作为一个不同的桶?

参考的窗口函数:http://blog.madhukaraphatak.com/introduction-to-spark-two-part-5/

注意:寻找基于Scala的解决方案

2 个答案:

答案 0 :(得分:1)

此链接可能对您有所帮助。

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

您可以创建WindowSpec对象并将其传递给range/rowsBetween函数。

我有一个演示但是用于不同的例子。在这里:

transactions.withColumn("column", transactions.col("cardNumber").over(Window.rowsBetween(x, y)))

答案 1 :(得分:0)

经过深度挖掘后,我能够借助以下转换将数据集拆分为十年智慧桶

//Deriving new column named "Date" based on "year" column. This step is required to bucket the data set into decade wise buckets 
val MovieDFwithDate=SortedByYear.withColumn("Date",format_string(("01-01-%d"),$"year"))

//Casting string version of date to standard DATE object
val MovieDFwithDateFormat = MovieDFwithDate.withColumn("Date",to_date($"Date","MM-dd-yyyy"))

//Windowing the data set into decade buckets - 365 days * 10 years = 3650 days
val windowedDF = MovieDFwithDateFormat.select($"*",window($"Date","3650 days","3650 days"))