如何在Spark2中使用SparkSession覆盖本机Spark / Hive UDF

时间:2018-03-27 17:04:42

标签: scala apache-spark hive apache-spark-sql spark-dataframe

如何在Spark2中使用SparkSession覆盖原生spark / hive UDF。这是改变spark / hive提供的默认行为所必需的,这反过来将允许我们支持遗留代码库。示例使用' trunc '。

观察:我能够在hive和spark中覆盖本机函数,但无法使用SparkSession在Spark 2中实现方法重载。

表格描述

<script src="https://unpkg.com/vue"></script>

<div id="dynamic-component-demo" class="demo">
  <button
    v-for="tab in tabs"
    v-bind:key="tab.name"
    v-bind:class="['tab-button', { active: currentTab.name === tab.name }]"
    v-on:click="currentTab = tab"
  >{{ tab.name }}</button>

  <component
    v-for="tab in tabs"
    v-if="activatedTabs[tab.name]"
    v-bind:key="'component-' + tab.name"
    v-bind:is="tab.component"
    class="tab"
    v-show="currentTab === tab"
  ></component>
</div>

查看记录:

hive> desc sample_table;
id                      int
name                    string
time_stamp              timestamp

Custom Hive UDF

hive> select * from sample_table;
1       Pratap Chandra Dhan     NULL
2       Dyuti Ranjan Nayak      2016-01-01 00:00:00
3       Rajesh  NULL

尝试使用一个arg运行' trunc '函数,它将失败,因为它需要两个args

package com.spark2.udf;

import java.sql.Timestamp;
import org.apache.hadoop.hive.ql.exec.UDF;

public class Trunc extends UDF {

public Integer evaluate(Timestamp input) {
    if (input == null) {
        return null;
    } else {
        return input.getDay();
    }
}

public Long evaluate(Long input) {
    if (input == null) {
        return null;
    } else {
        return input * 1000;
    }
}

public String evaluate(Timestamp input, String str) {
    if (input == null) {
        return null;
    } else {
        return input.getMonth() + "_" + str;
    }
}
}

现在让我们使用上面的类覆盖,因为我们需要添加jar和register trunc函数。

hive> select trunc(time_stamp)  from sample_table;
FAILED: SemanticException [Error 10015]: Line 1:7 Arguments length mismatch 'time_stamp': trunc() requires 2 argument, got 1

方法I:在spark2中使用SparkSession [不起作用]:

hive> list jars;

/*add jars*/
hive> add jar spar2-udf-0.0.1-SNAPSHOT.jar;
Added [spar2-udf-0.0.1-SNAPSHOT.jar] to class path
Added resources: [spar2-udf-0.0.1-SNAPSHOT.jar]

/*register trunc function*/
hive> CREATE TEMPORARY FUNCTION trunc AS "com.spark2.udf.Trunc";
OK
Time taken: 0.25 seconds

/*test trunc function that takes timestamp*/
hive> select trunc(time_stamp)  from sample_table;
OK
NULL
5
NULL
Time taken: 0.287 seconds, Fetched: 3 row(s)

/*test all function behaviour*/
hive> select trunc(id),trunc(time_stamp),trunc(time_stamp,name)  from 
sample_table;
OK
1000    NULL    NULL
2000    5       0_Dyuti Ranjan Nayak
3000    NULL    NULL
Time taken: 0.054 seconds, Fetched: 3 row(s)

方法II:在Spark2中使用HiveContext [不起作用]:

这种方法一直工作到Spark 1.6,但是因为HiveContext在Spark 2.x中被弃用了。这也行不通。

 scala> spark.sql("list jars").show;
 +-------+
 |Results|
 +-------+
 +-------+

 scala> spark.sql("add jar spar2-udf-0.0.1-SNAPSHOT.jar").show;
 +------+
 |result|
 +------+
 |     0|
 +------+

 scala> spark.sql("list jars").collect.foreach(println)
 [spark://10.113.57.185:47278/jars/spar2-udf-0.0.1-SNAPSHOT.jar]


scala> spark.sql("CREATE TEMPORARY FUNCTION trunc AS 'com.spark2.udf.Trunc'").collect.foreach(println)
org.apache.spark.sql.AnalysisException: Function trunc already exists;     
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.registerFunction(SessionCatalog.scala:1083)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:63)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
... 48 elided

方法III:在Spark2中使用SparkSession [有效]:

scala> import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.hive.HiveContext

scala> val hiveContext = new HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@e4b9a64

scala> hiveContext.sql("CREATE TEMPORARY FUNCTION trunc AS 'com.spark2.udf.Trunc'").collect.foreach(println)
org.apache.spark.sql.AnalysisException: Function trunc already exists;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.registerFunction(SessionCatalog.scala:1083)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:63)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:636)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
... 48 elided

但方法重载不起作用,以下查询将失败:

scala> spark.udf.register("trunc", (input:java.sql.Timestamp,str:String) => 
input.getMonth() + "_" + str)
...

scala> spark.sql("select trunc(time_stamp,name)  from sample_table where 
id=2").collect.foreach(println);
[0_Dyuti Ranjan Nayak]

0 个答案:

没有答案