Question

我必须附加方法＆＃39; strToInt＆＃39;生成的这一列。结果是不可序列化的。

def strToInt(colVal : String) : Int = {
  var str = new Array[String](3)
  str(0) = "icmp"; str(1) = "tcp"; str(2) = "udp"
  var i = 0
  for (i <- 0 to str.length-1) {
    if (str(i) == colVal) { return i }
  }
  throw new IllegalStateException("This never happens")
}
val strtoint = udf(strToInt(_:String)).apply(col("Atr 1"))
val newDF = df.withColumn("newCol", strtoint)

我已尝试以这种方式将函数放入辅助类中，

object Helper extends Serializable {
    def strToInt ...     
                                    }

但它没有帮助。

Answer 1

理解这里发生的事情的关键是，虽然Scala是一种函数式编程语言，但它在JVM上运行，而JVM不支持函数类型。在运行时，任何val分配了一个＆＃34;匿名＆＃34;或＆＃34; lambda＆＃34;函数实际上是具有apply方法的匿名类的实例。所以，让我们说你有以下内容：

object helper {
  val isNegative: (Int => Boolean) = (n: Int) => n < 0
}

这与以下内容相同：

object helper {
  val isNegative: Function1[Int, Boolean] = {
    def apply(n: Int): Boolean = n < 0
  }
}

isNegative实际上是一个扩展特征Function1的匿名类实例。当你这样做时：

object helper {
  def isNegative(n: Int): Boolean = n < 0
}

现在isNegative是对象helper的方法。谈到Spark，如果你要做这样的事情：

// ds is a Dataset[Int]
ds.filter(isNegative)

在第一种情况下，Spark必须序列化分配给isNegative的匿名类，并且因为它不可序列化而失败。在第二种情况下，它必须序列化helper才能正常工作，因为object是可序列化的，如果它的所有状态都是可序列化的。

要将此问题应用于您的问题，请执行以下操作：

val strtoint = udf(strToInt(_:String)).apply(col("Atr 1"))

在运行时，strtoint是一个具有特征Funtion1[String, UserDefinedFunction]的匿名类实例，这是一个在被调用时生成UserDefinedFunction的方法。填写下划线，它与此相同：

val strtoInt: Function1[String, UserDefinedFunction] = new Function1[String, UserDefinedFunction] = {
  def apply(t1: String) = udf(strToInt(t1 :String)).apply(col("Atr 1"))
}

最低限度地更改代码，您只需将val更改为def：

def sti = udf(strToInt(_:String)).apply(col("Atr 1"))

现在sti是它的封闭类的成员函数，如果它是可序列化的，那么就Spark而言你应该是好的。另外要记住的是strToInt也需要成为可序列化class或object

的一部分

另一种解决此问题的方法是将val strtoint更改为UserDefinedFunction case class并因此可序列化，但您仍需要确保{ {1}}是可序列化的strToInt或class。

的成员

Answer 2

将函数执行时的代码更改为withColumn级别（而不是定义UDF时）。

// define a UDF
val strtoint = udf(strToInt _)
// use it (aka execute)
val newDF = df.withColumn("newCol", strtoint(col("Atr 1")))

看似的小变化会改变您创建的内容以及之后如何执行它。

正如您可能已经注意到的那样，udf创建了一个Spark SQL理解的用户定义函数（可以执行）：

udf [RT，A1]（f：（A1）⇒RT）：UserDefinedFunction 将1个参数的用户定义函数定义为用户定义函数（UDF）。

（我删除了隐含参数以便于理解）

引用UserDefinedFunction的scaladoc：

用户定义的功能。要创建一个，请使用函数中的udf函数。

我不太同意，但“协议”是先在您的查询中执行UDF之前注册UDF，比如说withColumn或select运营商。

我还要将strToInt更改为Scala-idiomatic（并且希望更容易理解）。

def strToInt(colVal : String) : Int = {
  val strs = Array("icmp", "tcp", "udp")
  strs.indexOf(colVal)
}

Answer 3

这个问题似乎与我所遇到的问题（在Java中）相似。我的udf函数正在使用密码库对某些内容进行加密，并且抛出的异常是：

Caused by: java.io.NotSerializableException: javax.crypto.Cipher Serialization stack: - object not serializable (class: javax.crypto.Cipher, value: javax.crypto.Cipher@625d02ce)

我无法向Cipher类添加“可实现序列化的实现”，因为它是Java提供的库。

我通过此链接使用了以下解决方案：spark-how-to-call-udf-over-dataset-in-java

private static UDF1 toUpper = new UDF1<String, String>() {
    public String call(final String str) throws Exception {
        return str.toUpperCase();
    }
};

注册UDF，即可使用callUDF函数。

import static org.apache.spark.sql.functions.callUDF;
import static org.apache.spark.sql.functions.col;

sqlContext.udf().register("toUpper", toUpper, DataTypes.StringType);
peopleDF.select(col("name"),callUDF("toUpper", col("name"))).show();

在何处而不是调用str.toUpperCase（）;我给我的Cipher实例打电话。

如何将用户定义的函数应用于列（给出＆＃34;任务不可序列化＆＃34;添加列时）？

3 个答案: