我有一个dataframe
,其字段为from_date
和to_date
:
(2017-01-10 2017-01-14)
(2017-01-03 2017-01-13)
和日期列表
2017-01-05,
2017-01-12,
2017-01-13,
2017-01-15
想法是从表中检索该日期列表在from_date和to_date之间的所有行。
预期输出:
同一数据框,但仅行(from_date和to_date)在日期列表的值范围内(<=或> =)的行。 到目前为止,我尝试了Nikk的建议:
Filter a spark dataframe with a greater than and a less than of list of dates
但是我需要与整个日期列表进行比较,诸如此类:
spark.sql("select * from dataframe_table where from_date >= (select date from date_list) AND to_date <= (select date from date_list)")
答案 0 :(得分:0)
如果要比较一个表的多个行与另一个表的多个行(让我们将日期列表视为一个具有单个列的表),则可以在两个表上使用联接。通常,您将测试是否等于表。在这种情况下,您的测试会更加特殊,因为您要比较第一个表的两列和第二个表的一列。 您可以为此使用datediff:
scala> val df1 = Seq(("2017-01-10", "2017-01-14")).toDF("start_date","end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))
df1: org.apache.spark.sql.DataFrame = [start_date: date, end_date: date]
scala> val df2 = Seq("2017-01-5", "2017-01-12","2017-01-13", "2017-01-15").toDF("from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [from_date: date]
scala> df2.join(df1, datediff('from_date,'start_date) > 0) && datediff('from_date,'end_date) < 0)).show()
答案 1 :(得分:0)
您的问题对我来说有点令人困惑,因此我提供了两个基于场景的代码。
1)如果您要过滤提供的列表范围(例如,从2017年1月5日到2017年1月15日)之间的日期,那么对于这种情况,代码段下面会显示。
//Created dataframe view for both data
Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date",'from_date.cast("date")).withColumn("to_date",'to_date.cast("date")).createOrReplaceTempView("date_table")
List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")
spark.sql("select * from date_table where (from_date BETWEEN (select min(cast(list as date)) from date_list) and (select max(cast(list as date)) from date_list)) and (to_date between (select min(cast(list as date)) from date_list) and (select max(cast(list as date)) from date_list))").show()
+----------+----------+
|from_date| to_date|
+----------+----------+
|2017-01-10|2017-01-14|
+----------+----------+
2)或者,如果要从to_date和end_date不在提供的日期列表中的数据框中过滤日期。因此,根据您提供的数据示例,列表之间将没有日期。对于这种情况,下面的代码将起作用。
//Created dataframe view for both data
Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date",'from_date.cast("date")).withColumn("to_date",'to_date.cast("date")).createOrReplaceTempView("date_table")
List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")
spark.sql("select * from date_table where from_date in (select cast(list as date) from date_list) and to_date in (select cast(list as date) from date_list)").show()
+----------+----------+
|from_date| to_date|
+----------+----------+
| | |
+----------+----------+
请让我知道我是否错过了任何事情。
答案 2 :(得分:0)
请检查以下内容:
//Creating DataFrame with Column from_date and to_date, you can ignore this step if you have dataframe
scala> val df = Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date", col("from_date").cast("date")).withColumn("to_date",col("to_date").cast("date"))
df: org.apache.spark.sql.DataFrame = [from_date: date, to_date: date]
scala> df.show()
+----------+----------+
| from_date| to_date|
+----------+----------+
|2017-01-10|2017-01-14|
|2017-01-03|2017-01-13|
+----------+----------+
//creating temparary View for dataframe "df" so that we can use it in spark sql.
scala> df.createOrReplaceTempView("dataframe_table")
//Converting List into Temp view
List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")
//Query to retrive all data from dataframe where from_date and to_date are in range of list.
scala> val output = spark.sql("select * from dataframe_table where from_date >= (select min(cast(list as date)) from date_list) AND to_date <= (select max(cast(list as date)) from date_list)")
output: org.apache.spark.sql.DataFrame = [from_date: date, to_date: date]
scala> output.show()
+----------+----------+
| from_date| to_date|
+----------+----------+
|2017-01-10|2017-01-14|
+----------+----------+