过滤日期列表大于和小于的Spark数据框

时间:2019-06-05 12:37:20

标签: scala apache-spark dataframe apache-spark-sql

我有一个dataframe,其字段为from_dateto_date

(2017-01-10     2017-01-14)
(2017-01-03     2017-01-13)

和日期列表

2017-01-05,
2017-01-12,
2017-01-13,
2017-01-15

想法是从表中检索该日期列表在from_date和to_date之间的所有行。

预期输出:

同一数据框,但仅行(from_date和to_date)在日期列表的值范围内(<=或> =)的行。 到目前为止,我尝试了Nikk的建议:

Filter a spark dataframe with a greater than and a less than of list of dates

但是我需要与整个日期列表进行比较,诸如此类:


spark.sql("select * from dataframe_table where from_date  >= (select  date from date_list) AND  to_date  <= (select date from date_list)")

3 个答案:

答案 0 :(得分:0)

如果要比较一个表的多个行与另一个表的多个行(让我们将日期列表视为一个具有单个列的表),则可以在两个表上使用联接。通常,您将测试是否等于表。在这种情况下,您的测试会更加特殊,因为您要比较第一个表的两列和第二个表的一列。 您可以为此使用datediff

scala> val df1 = Seq(("2017-01-10", "2017-01-14")).toDF("start_date","end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))

df1: org.apache.spark.sql.DataFrame = [start_date: date, end_date: date]

scala> val df2 = Seq("2017-01-5", "2017-01-12","2017-01-13", "2017-01-15").toDF("from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [from_date: date]

scala> df2.join(df1, datediff('from_date,'start_date) > 0) && datediff('from_date,'end_date) < 0)).show()

答案 1 :(得分:0)

您的问题对我来说有点令人困惑,因此我提供了两个基于场景的代码。

1)如果您要过滤提供的列表范围(例如,从2017年1月5日到2017年1月15日)之间的日期,那么对于这种情况,代码段下面会显示。

//Created dataframe view for both data
    Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date",'from_date.cast("date")).withColumn("to_date",'to_date.cast("date")).createOrReplaceTempView("date_table")


    List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")

   spark.sql("select * from date_table where (from_date BETWEEN (select min(cast(list as date)) from date_list) and (select max(cast(list as date)) from date_list)) and (to_date between (select min(cast(list as date)) from date_list) and (select max(cast(list as date)) from date_list))").show()
    +----------+----------+
    |from_date|  to_date|
    +----------+----------+
    |2017-01-10|2017-01-14|
    +----------+----------+

2)或者,如果要从to_date和end_date不在提供的日期列表中的数据框中过滤日期。因此,根据您提供的数据示例,列表之间将没有日期。对于这种情况,下面的代码将起作用。

//Created dataframe view for both data
       Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date",'from_date.cast("date")).withColumn("to_date",'to_date.cast("date")).createOrReplaceTempView("date_table")


        List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")

        spark.sql("select * from date_table where from_date in (select cast(list as date) from date_list) and to_date in (select cast(list as date) from date_list)").show() 

    +----------+----------+
    |from_date|  to_date|
    +----------+----------+
    |            |            |
    +----------+----------+

请让我知道我是否错过了任何事情。

答案 2 :(得分:0)

请检查以下内容:

//Creating DataFrame with Column from_date and to_date, you can ignore this step if you have dataframe

scala> val df =  Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date", col("from_date").cast("date")).withColumn("to_date",col("to_date").cast("date"))
df: org.apache.spark.sql.DataFrame = [from_date: date, to_date: date]

scala> df.show()
+----------+----------+
| from_date|   to_date|
+----------+----------+
|2017-01-10|2017-01-14|
|2017-01-03|2017-01-13|
+----------+----------+

//creating temparary View for dataframe "df" so that we can use it in spark sql.
scala> df.createOrReplaceTempView("dataframe_table")

//Converting List into Temp view
List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")


//Query to retrive all data from dataframe where from_date and to_date are in range of list.

scala> val output =  spark.sql("select * from dataframe_table where from_date  >= (select min(cast(list as date)) from date_list) AND  to_date  <= (select max(cast(list as date)) from date_list)")
output: org.apache.spark.sql.DataFrame = [from_date: date, to_date: date]

scala> output.show()
+----------+----------+                                                         
| from_date|   to_date|
+----------+----------+
|2017-01-10|2017-01-14|
+----------+----------+