我用java编写了一个sparksql UDF,但似乎出了点问题

时间:2018-01-10 15:29:46

标签: java apache-spark apache-spark-sql user-defined-functions

我的项目的整个依赖关系如下:

<dependencies>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.1.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.1.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.11</artifactId>
        <version>2.1.2</version>
    </dependency>

</dependencies>

我想使用UDF来计算两个输入日期字符串之间的时间间隔,格式为'yyyy-mm-ss HH:mm:ss.SSS'(例如,'2017-12-26 00 :00:02.044'),结果将是精度到毫秒的两倍,例如,当我通过“2017-12-26 00:00:02.044”,“2017-12-26 00:00:03.045”时UDF结果将是1.001秒,然后是java代码片段:

import org.apache.commons.lang.StringUtils;
import org.apache.spark.sql.api.java.UDF2;

import java.text.SimpleDateFormat;
import java.util.Date;

public class DateDistance implements UDF2<String,String,Double> {

    public Double call(String s, String s2) throws Exception {
        Double result=0D;
        if(StringUtils.isNotBlank(s)&&StringUtils.isNotBlank(s2)){
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-mm-ss HH:mm:ss.SSS");
            Date parse = sdf.parse(s);
            Date parse2=sdf.parse(s2);
            Long milisecond1= parse.getTime();
            Long milisecond2= parse2.getTime();
            Long abs = Math.abs(milisecond1 - milisecond2);
            result = (abs.doubleValue()) / 1000D;
        }
        return result;
    }
}

使用UDF的步骤如下:

  1. 添加jar /home/hulk/learning/datedistance-1.0-SNAPSHOT.jar
  2. 创建临时函数tmp_date_distance为'com.test.datedistance.DateDistance'
  3. 使用sql:
  4. 测试UDF
    Select tmp_date_distance('2017-12-26 00:00:02.044','2017-12-26
    00:00:03.045') from stg.car_fact_order where dt='2018-01-09' limit 1;
    

    之后,我得到了以下提示:

    Error in query: No handler for Hive UDF 'com.sqyc.datedistance.DateDistance'; line 1 pos 7
    

    你能给我一些建议吗?

1 个答案:

答案 0 :(得分:0)

第二步不正确:

  

创建临时函数tmp_date_distance为'com.test.datedistance.DateDistance'

Spark UDF不兼容Hive,应该注册

sqlContext.udf().register(name, object, type);

或(2.0或更高版本):

spark.udf().register(name, object, type);

但你不需要udf:

SELECT ABS(
       CAST(CAST('2017-12-26 00:00:02.044' AS TIMESTAMP) AS DOUBLE) - 
       CAST(CAST('2017-12-26 00:00:03.045' AS TIMESTAMP) AS DOUBLE) ) AS diff

+-----------------+
|             diff|
+-----------------+
|1.001000165939331|
+-----------------+

或四舍五入:

SELECT ROUND(ABS(
       CAST(CAST('2017-12-26 00:00:02.044' AS TIMESTAMP) AS DOUBLE) - 
       CAST(CAST('2017-12-26 00:00:03.045' AS TIMESTAMP) AS DOUBLE)), 3) AS diff

+-----+
| diff|
+-----+
|1.001|
+-----+