如何在spark SQL中实现自动增量(PySpark)

时间:2016-10-25 04:20:44

标签: apache-spark hive apache-spark-sql pyspark-sql

我需要在spark sql表中实现一个自动增量列,我该怎么办呢。请指导我。我正在使用pyspark 2.0

谢谢 格利扬

1 个答案:

答案 0 :(得分:1)

我会编写/重用有状态的Hive udf 并注册pySpark,因为Spark SQL确实对Hive有很好的支持。

在下面的代码中检查此行@UDFType(deterministic = false, stateful = true),以确保它是有状态的UDF。

package org.apache.hadoop.hive.contrib.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;

/**
 * UDFRowSequence.
 */
@Description(name = "row_sequence",
    value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
  private LongWritable result = new LongWritable();

  public UDFRowSequence() {
    result.set(0);
  }

  public LongWritable evaluate() {
    result.set(result.get() + 1);
    return result;
  }
}

// End UDFRowSequence.java

现在构建jar并在pyspark启动时添加位置。

$ pyspark --jars your_jar_name.jar

然后注册sqlContext

sqlContext.sql("CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'")

现在在选择查询中使用row_seq()

sqlContext.sql("SELECT row_seq(), col1, col2 FROM table_name")

Project to use Hive UDFs in pySpark