我在DataFrame中有数据,其中包含以下列
所有以下列数据类型都是String
雇员,pexpense,cexpense
现在,我需要创建一个新的DataFrame,其中包含名为expense
的新列,该列基于列pexpense
,cexpense
计算。
棘手的部分是计算算法不是我创建的 UDF 函数,但它是一个需要从Java库导入的外部函数,它将原始类型作为参数 - 在这种情况下pexpense
,cexpense
- 计算新列所需的值。
来自外部Java jar的函数签名
public class MyJava
{
public Double calculateExpense(Double pexpense, Double cexpense) {
// calculation
}
}
那么如何调用该外部函数来创建新的计算列。我可以在Spark应用程序中将该外部函数注册为UDF吗?
答案 0 :(得分:1)
您可以创建类似于以下内容的外部方法的UDF(使用Scala REPL说明):
// From a Linux shell prompt:
vi MyJava.java
public class MyJava {
public Double calculateExpense(Double pexpense, Double cexpense) {
return pexpense + cexpense;
}
}
:wq
javac MyJava.java
jar -cvf MyJava.jar MyJava.class
spark-shell --jars /path/to/jar/MyJava.jar
// From within the Spark shell
val df = Seq(
("1", "1.0", "2.0"), ("2", "3.0", "4.0")
).toDF("employeeid", "pexpense", "cexpense")
val myJava = new MyJava
val myJavaUdf = udf(
myJava.calculateExpense _
)
val df2 = df.withColumn("totalexpense", myJavaUdf($"pexpense", $"cexpense") )
df2.show
+----------+--------+--------+------------+
|employeeid|pexpense|cexpense|totalexpense|
+----------+--------+--------+------------+
| 1| 1.0| 2.0| 3.0|
| 2| 3.0| 4.0| 7.0|
+----------+--------+--------+------------+
答案 1 :(得分:0)
您可以通过将其作为参数传递给{
"currencies": [
{
"currency": "CAD",
"default": false,
"defaultOption": "Interac",
"paymentOptions": "Cheque|Credit Card|Interac|PayPal",
"sheetrow": 2
},
{
"currency": "USD",
"default": true,
"defaultOption": "PopMoney",
"paymentOptions": "Cheque|Credit Card|PayPal|PopMoney",
"sheetrow": 3
}
],
"meals": {
"use": true,
"required": false,
"default": "Omnivore",
"values": "Omnivore|Vegan"
}
}
中的udf
函数,简单地“封装”UDF中的给定方法:
org.apache.spark.sql.functions
这假设import org.apache.spark.sql.functions._
import spark.implicits._
val myUdf = udf(calculateExpense _)
val newDF = df.withColumn("expense", myUdf($"pexpense", $"cexpense"))
和pexpense
列都是cexpense
s。
答案 2 :(得分:-1)
bellow,是总和两列的一个例子:
val somme= udf((a: Int, b: int) => a+b)
val df_new = df.select(col("employeeid"), \
col("pexpense"), \
col("pexpense"), \
somme(col("pexpense"), col("pexpense")) as "expense")