使用Spark(Spark SQL)2.0.0注册Hive自定义UDF

时间:2016-11-01 21:47:34

标签: apache-spark apache-spark-sql udf

我正在开发一个spark 2.0.0,我的要求是在我的sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数在其中一个查询中使用。在我的Hive查询集群中,我通过定义:CREATE TEMPORARY FUNCTION myFunc AS'com.facebook.hive.udf.UDFNumberRows',将其用作临时函数,非常简单。

我尝试使用sparkSession将其注册如下,但出现错误:

sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")

错误:

CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
    at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.delayedEndpoint$com$mediamath$spark$attribution$sparkjob$SparkVideoCidJoin$1(SparkVideoCidJoin.scala:75)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$delayedInit$body.apply(SparkVideoCidJoin.scala:22)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.main(SparkVideoCidJoin.scala:22)
    at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin.main(SparkVideoCidJoin.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)

有没有人知道如何在spark询问时注册它,即使用sparkSession和SQLContext中的寄存器api:

 sqlContext.udf.register(...)

3 个答案:

答案 0 :(得分:3)

在Spark 2.0中,

package projekt;

import javax.swing.*;
import java.awt.*;

public class ProjectFunction extends JFrame {


    public ProjectFunction() {


        setLayout(new BorderLayout());


        setSize(1900, 1000);
        setTitle("First Test");
        setVisible(true);
        setDefaultCloseOperation(EXIT_ON_CLOSE);
    }


    public void paint(Graphics g){
        g.setColor(new Color(204, 204, 204));
        g.drawLine(0, 900, 1000, 900);
        g.drawLine(0, 800, 1000, 800);
        g.drawLine(0, 700, 1000, 700);
        g.drawLine(0, 600, 1000, 600);
        g.drawLine(0, 400, 1000, 400);
        g.drawLine(0, 300, 1000, 300);
        g.drawLine(0, 200, 1000, 200);
        g.drawLine(0, 100, 1000, 100);
        g.drawLine(100, 0, 100, 1000);
        g.drawLine(200, 0, 200, 1000);
        g.drawLine(300, 0, 300, 1000);
        g.drawLine(400, 0, 400, 1000);
        g.drawLine(600, 0, 600, 1000);
        g.drawLine(700, 0, 700, 1000);
        g.drawLine(800, 0, 800, 1000);
        g.drawLine(900, 0, 900, 1000);
        g.setColor(Color.BLACK);
        g.drawRect(0, 500, 1000, 1);
        g.drawRect(500, 0, 1, 1000);
        g.setColor(Color.RED);
        linear(0.25, 1, g);
        g.setColor(Color.BLUE);
        linear(-3, -2.5, g);

    }


    public void linear(double s, double c, Graphics g) {
        int Anzpunkte = 0;
        c = c * 100;
        int x = 500, y = 500 - (int) c;
        g.drawOval(x, y, 2, 2);
        y = y - (int) s;
        double abtrag = s - (int) s;
        System.out.println("Punkt   X-Achse  Y-Achse   Abtrag   Steigung");
        Anzpunkte++;
        System.out.println("" + Anzpunkte + "      " + x + "      " + y + "       " + abtrag + "   " + s);
        while (x < 1000 && y < 1000 && x > 0 && y > 0) {
            x++;
            g.drawOval(x, y, 2, 2);
            Anzpunkte++;
            System.out.println("" + Anzpunkte + "      " + x + "      " + y + "       " + abtrag + "   " + s);
            if (abtrag >= 1 || abtrag <= -1) {
                y = y - (int) s;
                y = y - (int) abtrag;
                abtrag = s - (int) s;
            } else {
                y = y - (int) s;
                abtrag = abtrag + s - (int) s;
            }
        }
        x = 500;
        y = 500 - (int) c;
        while (x < 1000 && y < 1000 && x > 0 && y > 0) {
            x--;
            g.drawOval(x, y, 2, 2);
            Anzpunkte++;
            System.out.println("" + Anzpunkte + "      " + x + "      " + y + "       " + abtrag + "   " + s);
            if (abtrag >= 1 || abtrag <= -1) {
                y = y + (int) s;
                y = y + (int) abtrag;
                abtrag = s - (int) s;
            } else {
                y = y + (int) s;
                abtrag = abtrag + s - (int) s;
            }
        }


    }


    public static void main(String[] args) {
        ProjectFunction p = new ProjectFunction();


    }

}

允许您注册Java或Scala UDF(类型为Long =&gt; Long的函数),但不能注册处理LongWritable而不是Long的Hive GenericUDFs,并且可以使用可变数量的参数。

要注册Hive UDF,您的第一种方法是正确的:

sparkSession.udf.register(...) 

但是,您必须首先启用Hive支持:

sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")

并确保类路径中存在“spark-hive”依赖项。

说明:

您的错误消息

SparkSession.builder().enableHiveSupport()

来自班级SessionCatalog

致电SparkSession.builder().enableHiveSupport(),火花 将使用HiveSessionCatalog替换SessionCatalog,其中实现了方法makeFunctionBuilder

最后:

你想要使用的UDF,'com.facebook.hive.udf.UDFNumberRows',是在Hive中没有窗口函数的时候编写的。 我建议你改用它们。你可以查看Hive Reference, 这个Spark-SQL introthis if you want to stick to the scala syntax

答案 1 :(得分:1)

您可以使用SparkSession直接注册UDF,如sparkSession.udf.register("myUDF", (arg1: Int, arg2: String) => arg2 + arg1)中所示。查看详细文档here

答案 2 :(得分:1)

您遇到的问题是Spark没有在他的classPath中加载jar库。

在我们的团队中,我们使用--jars选项加载外部库。

/usr/bin/spark-submit  --jars external_library.jar our_program.py --our_params 

您可以在 Spark History - Environment 标签中检查是否要加载外部库。 ( spark.yarn.secondary.jars

然后,您可以按照说明注册 udf 。一旦你启用了HiveSupport,就像FurryMachine所说的那样。

sparkSession.sql("""
    CREATE TEMPORARY FUNCTION myFunc AS  
    'com.facebook.hive.udf.UDFNumberRows'
""")

您可以在 spark-summit --help

中找到更多信息
hadoop:~/projects/neocortex/src$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]   
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.