如何在Spark2中使用SparkSession覆盖原生spark / hive UDF。这是改变spark / hive提供的默认行为所必需的,这反过来将允许我们支持遗留代码库。示例使用' trunc '。
观察:我能够在hive和spark中覆盖本机函数,但无法使用SparkSession在Spark 2中实现方法重载。
表格描述
<script src="https://unpkg.com/vue"></script>
<div id="dynamic-component-demo" class="demo">
<button
v-for="tab in tabs"
v-bind:key="tab.name"
v-bind:class="['tab-button', { active: currentTab.name === tab.name }]"
v-on:click="currentTab = tab"
>{{ tab.name }}</button>
<component
v-for="tab in tabs"
v-if="activatedTabs[tab.name]"
v-bind:key="'component-' + tab.name"
v-bind:is="tab.component"
class="tab"
v-show="currentTab === tab"
></component>
</div>
查看记录:
hive> desc sample_table;
id int
name string
time_stamp timestamp
Custom Hive UDF
hive> select * from sample_table;
1 Pratap Chandra Dhan NULL
2 Dyuti Ranjan Nayak 2016-01-01 00:00:00
3 Rajesh NULL
尝试使用一个arg运行' trunc '函数,它将失败,因为它需要两个args
package com.spark2.udf;
import java.sql.Timestamp;
import org.apache.hadoop.hive.ql.exec.UDF;
public class Trunc extends UDF {
public Integer evaluate(Timestamp input) {
if (input == null) {
return null;
} else {
return input.getDay();
}
}
public Long evaluate(Long input) {
if (input == null) {
return null;
} else {
return input * 1000;
}
}
public String evaluate(Timestamp input, String str) {
if (input == null) {
return null;
} else {
return input.getMonth() + "_" + str;
}
}
}
现在让我们使用上面的类覆盖,因为我们需要添加jar和register trunc函数。
hive> select trunc(time_stamp) from sample_table;
FAILED: SemanticException [Error 10015]: Line 1:7 Arguments length mismatch 'time_stamp': trunc() requires 2 argument, got 1
方法I:在spark2中使用SparkSession [不起作用]:
hive> list jars;
/*add jars*/
hive> add jar spar2-udf-0.0.1-SNAPSHOT.jar;
Added [spar2-udf-0.0.1-SNAPSHOT.jar] to class path
Added resources: [spar2-udf-0.0.1-SNAPSHOT.jar]
/*register trunc function*/
hive> CREATE TEMPORARY FUNCTION trunc AS "com.spark2.udf.Trunc";
OK
Time taken: 0.25 seconds
/*test trunc function that takes timestamp*/
hive> select trunc(time_stamp) from sample_table;
OK
NULL
5
NULL
Time taken: 0.287 seconds, Fetched: 3 row(s)
/*test all function behaviour*/
hive> select trunc(id),trunc(time_stamp),trunc(time_stamp,name) from
sample_table;
OK
1000 NULL NULL
2000 5 0_Dyuti Ranjan Nayak
3000 NULL NULL
Time taken: 0.054 seconds, Fetched: 3 row(s)
方法II:在Spark2中使用HiveContext [不起作用]:
这种方法一直工作到Spark 1.6,但是因为HiveContext在Spark 2.x中被弃用了。这也行不通。
scala> spark.sql("list jars").show;
+-------+
|Results|
+-------+
+-------+
scala> spark.sql("add jar spar2-udf-0.0.1-SNAPSHOT.jar").show;
+------+
|result|
+------+
| 0|
+------+
scala> spark.sql("list jars").collect.foreach(println)
[spark://10.113.57.185:47278/jars/spar2-udf-0.0.1-SNAPSHOT.jar]
scala> spark.sql("CREATE TEMPORARY FUNCTION trunc AS 'com.spark2.udf.Trunc'").collect.foreach(println)
org.apache.spark.sql.AnalysisException: Function trunc already exists;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.registerFunction(SessionCatalog.scala:1083)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:63)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
... 48 elided
方法III:在Spark2中使用SparkSession [有效]:
scala> import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@e4b9a64
scala> hiveContext.sql("CREATE TEMPORARY FUNCTION trunc AS 'com.spark2.udf.Trunc'").collect.foreach(println)
org.apache.spark.sql.AnalysisException: Function trunc already exists;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.registerFunction(SessionCatalog.scala:1083)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:63)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
... 48 elided
但方法重载不起作用,以下查询将失败:
scala> spark.udf.register("trunc", (input:java.sql.Timestamp,str:String) =>
input.getMonth() + "_" + str)
...
scala> spark.sql("select trunc(time_stamp,name) from sample_table where
id=2").collect.foreach(println);
[0_Dyuti Ranjan Nayak]