使用带外部函数的WithColumn

时间:2017-08-28 22:09:01

标签: scala apache-spark apache-spark-sql user-defined-functions

我在DataFrame中有数据,其中包含以下列

  1. Fileformat是csv
  2. 所有以下列数据类型都是String

    雇员,pexpense,cexpense

  3. 现在,我需要创建一个新的DataFrame,其中包含名为expense的新列,该列基于列pexpensecexpense计算。

    棘手的部分是计算算法不是我创建的 UDF 函数,但它是一个需要从Java库导入的外部函数,它将原始类型作为参数 - 在这种情况下pexpensecexpense - 计算新列所需的值。

    来自外部Java jar的函数签名

    public class MyJava
    
    {
    
        public Double calculateExpense(Double pexpense, Double cexpense) {
           // calculation
        }
    
    }
    

    那么如何调用该外部函数来创建新的计算列。我可以在Spark应用程序中将该外部函数注册为UDF吗?

3 个答案:

答案 0 :(得分:1)

您可以创建类似于以下内容的外部方法的UDF(使用Scala REPL说明):

// From a Linux shell prompt:

vi MyJava.java
public class MyJava {
    public Double calculateExpense(Double pexpense, Double cexpense) {
        return pexpense + cexpense;
    }
}
:wq

javac MyJava.java
jar -cvf MyJava.jar MyJava.class

spark-shell --jars /path/to/jar/MyJava.jar

// From within the Spark shell

val df = Seq(
  ("1", "1.0", "2.0"), ("2", "3.0", "4.0")
).toDF("employeeid", "pexpense", "cexpense")

val myJava = new MyJava

val myJavaUdf = udf(
  myJava.calculateExpense _
)

val df2 = df.withColumn("totalexpense", myJavaUdf($"pexpense", $"cexpense") )

df2.show
+----------+--------+--------+------------+
|employeeid|pexpense|cexpense|totalexpense|
+----------+--------+--------+------------+
|         1|     1.0|     2.0|         3.0|
|         2|     3.0|     4.0|         7.0|
+----------+--------+--------+------------+

答案 1 :(得分:0)

您可以通过将其作为参数传递给{ "currencies": [ { "currency": "CAD", "default": false, "defaultOption": "Interac", "paymentOptions": "Cheque|Credit Card|Interac|PayPal", "sheetrow": 2 }, { "currency": "USD", "default": true, "defaultOption": "PopMoney", "paymentOptions": "Cheque|Credit Card|PayPal|PopMoney", "sheetrow": 3 } ], "meals": { "use": true, "required": false, "default": "Omnivore", "values": "Omnivore|Vegan" } } 中的udf函数,简单地“封装”UDF中的给定方法:

org.apache.spark.sql.functions

这假设import org.apache.spark.sql.functions._ import spark.implicits._ val myUdf = udf(calculateExpense _) val newDF = df.withColumn("expense", myUdf($"pexpense", $"cexpense")) pexpense列都是cexpense s。

答案 2 :(得分:-1)

bellow,是总和两列的一个例子:

val somme= udf((a: Int, b: int) => a+b)

val df_new = df.select(col("employeeid"), \
                       col("pexpense"),   \
                       col("pexpense"),   \
                       somme(col("pexpense"), col("pexpense")) as "expense")