答案是使用NVL，这段代码在python中运行

Question

我需要在连接两个数据帧时在spark中实现NVL功能。

输入数据帧：

ds1.show()
---------------
|key  | Code  |
---------------
|2    | DST   |
|3    | CPT   |
|null | DTS   |
|5    | KTP   |
---------------

ds2.show()
------------------
|key  | PremAmt |
------------------
|2     | 300   |
|-1    | -99   |
|5     | 567   |
------------------

需要实现“LEFT JOIN NVL（DS1.key，-1）= DS2.key”。所以我写的是这样的，但缺少NVL或Coalesce函数。所以它返回了错误的值。

如何在火花数据帧中加入“NVL”？

// nvl function is missing, so wrong output
ds1.join(ds1,Seq("key"),"left_outer")

-------------------------
|key  | Code  |PremAmt  |
-------------------------
|2    | DST   |300      |
|3    | CPT   |null     |
|null | DTS   |null     |
|5    | KTP   |567      |
-------------------------

预期结果：

-------------------------
|key  | Code  |PremAmt  |
-------------------------
|2    | DST   |300      |
|3    | CPT   |null     |
|null | DTS   |-99      |
|5    | KTP   |567      |
-------------------------

Answer 1

我知道一种复杂的方式。

 val df = df1.join(df2, coalesce(df1("key"), lit(-1)) === df2("key"), "left_outer")

您应该重命名一个df的列名“key”，并在连接后删除该列。

Answer 2

答案是使用NVL，这段代码在python中运行

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("CommonMethods").getOrCreate()

注意：SparkSession正以“链式”方式出现，即。在同一行中应用了3种方法

读取CSV文件

df = spark.read.csv('C:\\tableausuperstore1_all.csv',inferSchema='true',header='true')

df.createOrReplaceTempView("ViewSuperstore")

ViewSuperstore可以用于SQL NOW

print("*trace1-nvl")

df = spark.sql("select nvl(state,'a') testString, nvl(quantity,0) testInt  from ViewSuperstore where state='Florida' and OrderDate>current_date() ")

df.show()

print("*trace2-FINAL")

Answer 3

在Scala中实现nvl

import org.apache.spark.sql.Column;
import org.apache.spark.sql.functions.{when, lit};

def nvl(ColIn: Column, ReplaceVal: Any): Column = {
  return(when(ColIn.isNull, lit(ReplaceVal)).otherwise(ColIn))
}

现在您可以像使用其他任何功能来进行数据帧操作一样使用nvl，例如

val NewDf = DF.withColumn("MyColNullsReplaced", nvl($"MyCol", "<null>"))

很显然，Replaceval必须是正确的类型。上面的示例假设$"MyCol"的类型为字符串。

Answer 4

这对我有用：

intermediateDF.select(col("event_start_timestamp"),
        col("cobrand_id"),
        col("rule_name"),
        col("table_name"),
        coalesce(col("dimension_field1"),lit(0)),
        coalesce(col("dimension_field2"),lit(0)),
        coalesce(col("dimension_field3"),lit(0)),
        coalesce(col("dimension_field4"),lit(0)),
        coalesce(col("dimension_field5"),lit(0))
      )

Spark Dataframe - 加入

4 个答案:

答案是使用NVL，这段代码在python中运行

读取CSV文件

ViewSuperstore可以用于SQL NOW

在Scala中实现nvl