我正在开发一个spark 2.0.0,我的要求是在我的sql上下文中使用'com.facebook.hive.udf.UDFNumberRows'函数在其中一个查询中使用。在我的Hive查询集群中,我通过定义:CREATE TEMPORARY FUNCTION myFunc AS'com.facebook.hive.udf.UDFNumberRows',将其用作临时函数,非常简单。
我尝试使用sparkSession将其注册如下,但出现错误:
sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
错误:
CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.delayedEndpoint$com$mediamath$spark$attribution$sparkjob$SparkVideoCidJoin$1(SparkVideoCidJoin.scala:75)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$delayedInit$body.apply(SparkVideoCidJoin.scala:22)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin$.main(SparkVideoCidJoin.scala:22)
at com.mediamath.spark.attribution.sparkjob.SparkVideoCidJoin.main(SparkVideoCidJoin.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
有没有人知道如何在spark询问时注册它,即使用sparkSession和SQLContext中的寄存器api:
sqlContext.udf.register(...)
答案 0 :(得分:3)
在Spark 2.0中,
package projekt;
import javax.swing.*;
import java.awt.*;
public class ProjectFunction extends JFrame {
public ProjectFunction() {
setLayout(new BorderLayout());
setSize(1900, 1000);
setTitle("First Test");
setVisible(true);
setDefaultCloseOperation(EXIT_ON_CLOSE);
}
public void paint(Graphics g){
g.setColor(new Color(204, 204, 204));
g.drawLine(0, 900, 1000, 900);
g.drawLine(0, 800, 1000, 800);
g.drawLine(0, 700, 1000, 700);
g.drawLine(0, 600, 1000, 600);
g.drawLine(0, 400, 1000, 400);
g.drawLine(0, 300, 1000, 300);
g.drawLine(0, 200, 1000, 200);
g.drawLine(0, 100, 1000, 100);
g.drawLine(100, 0, 100, 1000);
g.drawLine(200, 0, 200, 1000);
g.drawLine(300, 0, 300, 1000);
g.drawLine(400, 0, 400, 1000);
g.drawLine(600, 0, 600, 1000);
g.drawLine(700, 0, 700, 1000);
g.drawLine(800, 0, 800, 1000);
g.drawLine(900, 0, 900, 1000);
g.setColor(Color.BLACK);
g.drawRect(0, 500, 1000, 1);
g.drawRect(500, 0, 1, 1000);
g.setColor(Color.RED);
linear(0.25, 1, g);
g.setColor(Color.BLUE);
linear(-3, -2.5, g);
}
public void linear(double s, double c, Graphics g) {
int Anzpunkte = 0;
c = c * 100;
int x = 500, y = 500 - (int) c;
g.drawOval(x, y, 2, 2);
y = y - (int) s;
double abtrag = s - (int) s;
System.out.println("Punkt X-Achse Y-Achse Abtrag Steigung");
Anzpunkte++;
System.out.println("" + Anzpunkte + " " + x + " " + y + " " + abtrag + " " + s);
while (x < 1000 && y < 1000 && x > 0 && y > 0) {
x++;
g.drawOval(x, y, 2, 2);
Anzpunkte++;
System.out.println("" + Anzpunkte + " " + x + " " + y + " " + abtrag + " " + s);
if (abtrag >= 1 || abtrag <= -1) {
y = y - (int) s;
y = y - (int) abtrag;
abtrag = s - (int) s;
} else {
y = y - (int) s;
abtrag = abtrag + s - (int) s;
}
}
x = 500;
y = 500 - (int) c;
while (x < 1000 && y < 1000 && x > 0 && y > 0) {
x--;
g.drawOval(x, y, 2, 2);
Anzpunkte++;
System.out.println("" + Anzpunkte + " " + x + " " + y + " " + abtrag + " " + s);
if (abtrag >= 1 || abtrag <= -1) {
y = y + (int) s;
y = y + (int) abtrag;
abtrag = s - (int) s;
} else {
y = y + (int) s;
abtrag = abtrag + s - (int) s;
}
}
}
public static void main(String[] args) {
ProjectFunction p = new ProjectFunction();
}
}
允许您注册Java或Scala UDF(类型为Long =&gt; Long的函数),但不能注册处理LongWritable而不是Long的Hive GenericUDFs,并且可以使用可变数量的参数。
要注册Hive UDF,您的第一种方法是正确的:
sparkSession.udf.register(...)
但是,您必须首先启用Hive支持:
sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
并确保类路径中存在“spark-hive”依赖项。
说明:
您的错误消息
SparkSession.builder().enableHiveSupport()
来自班级SessionCatalog。
致电SparkSession.builder().enableHiveSupport(),火花 将使用HiveSessionCatalog替换SessionCatalog,其中实现了方法makeFunctionBuilder。
最后:
你想要使用的UDF,'com.facebook.hive.udf.UDFNumberRows',是在Hive中没有窗口函数的时候编写的。 我建议你改用它们。你可以查看Hive Reference, 这个Spark-SQL intro或this if you want to stick to the scala syntax。
答案 1 :(得分:1)
您可以使用SparkSession
直接注册UDF,如sparkSession.udf.register("myUDF", (arg1: Int, arg2: String) => arg2 + arg1)
中所示。查看详细文档here
答案 2 :(得分:1)
您遇到的问题是Spark没有在他的classPath中加载jar库。
在我们的团队中,我们使用--jars选项加载外部库。
/usr/bin/spark-submit --jars external_library.jar our_program.py --our_params
您可以在 Spark History - Environment 标签中检查是否要加载外部库。 ( spark.yarn.secondary.jars )
然后,您可以按照说明注册 udf 。一旦你启用了HiveSupport,就像FurryMachine所说的那样。
sparkSession.sql("""
CREATE TEMPORARY FUNCTION myFunc AS
'com.facebook.hive.udf.UDFNumberRows'
""")
您可以在 spark-summit --help
中找到更多信息hadoop:~/projects/neocortex/src$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.