从表中读取值并在Spark中应用条件

时间:2017-09-17 19:51:41

标签: scala apache-spark apache-spark-sql

我有数据帧:df1

+------+--------+--------+--------+
| Name | value1 | value2 | value3 |
+------+--------+--------+--------+
| A    | 100    | null   |    200 |
| B    | 10000  | 300    |     10 |
| c    | null   | 10     |    100 |
+------+--------+--------+--------+

第二个数据帧:df2:

+------+------+
| Col1 | col2 |
+------+------+
| X    | 1000 |
| Y    | 2002 |
| Z    | 3000 |
+------+------+

我想读取table1中的值,如value1,value2和value3

使用新列将条件应用于table2:

cond1:当name = A且col2> value1时,将其标记为Y或N

cond2:当name = B且col2> value2然后是Y或N

cond3:当name = c且col2> value1和col2> value3,然后是Y或N

源代码:

df2.withColumn("cond1",when($"col2")>value1,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond2",when($"col2")>value2,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond3",when($"col2")>value1 && when($"col2")>value3,lit("Y")).otherwise(lit("N"))

输出:

+------+------+-------+-------+-------+
| Col1 | col2 | cond1 | cond2 | cond3 |
+------+------+-------+-------+-------+
| X    | 1000 | Y     | Y     | y     |
| Y    | 2002 | N     | Y     | Y     |
| Z    | 3000 | Y     | Y     | Y     |
+------+------+-------+-------+-------+

2 个答案:

答案 0 :(得分:1)

如果我正确理解您的问题,您可以加入两个数据帧并创建条件列,如下所示。几个笔记:

1)根据描述的条件,df1中的null被替换为Int.MinValue以进行简化的整数比较

2)由于df1很小,broadcast连接用于最小化排序/改组以获得更好的性能

val df1 = Seq(
  ("A", 100, Int.MinValue, 200),
  ("B", 10000, 300, 10),
  ("C", Int.MinValue, 10, 100)
).toDF("Name", "value1", "value2", "value3")

val df2 = Seq(
  ("A", 1000),
  ("B", 2002),
  ("C", 3000),
  ("A", 5000),
  ("A", 150),
  ("B", 250),
  ("B", 12000),
  ("C", 50)
).toDF("Col1", "col2")

val df3 = df2.join(broadcast(df1), df2("Col1") === df1("Name")).select(
  df2("Col1"),
  df2("col2"),
  when(df2("col2") > df1("value1"), "Y").otherwise("N").as("cond1"),
  when(df2("col2") > df1("value2"), "Y").otherwise("N").as("cond2"),
  when(df2("col2") > df1("value1") && df2("col2") > df1("value3"), "Y").otherwise("N").as("cond3")
)

df3.show
+----+-----+-----+-----+-----+
|Col1| col2|cond1|cond2|cond3|
+----+-----+-----+-----+-----+
|   A| 1000|    Y|    Y|    Y|
|   B| 2002|    N|    Y|    N|
|   C| 3000|    Y|    Y|    Y|
|   A| 5000|    Y|    Y|    Y|
|   A|  150|    Y|    Y|    N|
|   B|  250|    N|    N|    N|
|   B|12000|    Y|    Y|    Y|
|   C|   50|    Y|    Y|    N|
+----+-----+-----+-----+-----+

答案 1 :(得分:0)

您可以在rowNo中创建dataframes列,如下所示

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf1 = df1.withColumn("rowNo", row_number().over(Window.orderBy("Name")))
val tempdf2 = df2.withColumn("rowNo", row_number().over(Window.orderBy("Col1")))

然后您可以join使用下面创建的列

val joinedDF = tempdf2.join(tempdf1, Seq("rowNo"), "left")

最后,您可以使用selectwhen函数来获取最终的数据框

joinedDF.select($"Col1",
  $"col2",
  when($"col2">$"value1" || $"value1".isNull, "Y").otherwise("N").as("cond1"),
  when($"col2">$"value2" || $"value2".isNull, "Y").otherwise("N").as("cond2"),
  when(($"col2">$"value1" && $"col2">$"value3") || $"value3".isNull, "Y").otherwise("N").as("cond3"))

您应该将所需的数据框设为

+----+----+-----+-----+-----+
|Col1|col2|cond1|cond2|cond3|
+----+----+-----+-----+-----+
|X   |1000|Y    |Y    |Y    |
|Y   |2002|N    |Y    |Y    |
|Z   |3000|Y    |Y    |Y    |
+----+----+-----+-----+-----+

我希望答案很有帮助