如何修复:Spark UDF函数

时间:2019-04-29 16:17:08

标签: scala apache-spark

我有一个以下数据框:假设DF1为

root
 |-- VARIANTS: string (nullable = true)
 |-- VARIANT_ID: long (nullable = false)
 |-- CASE_ID: string (nullable = true)
 |-- APP_ID: integer (nullable = false)

变体(字符串)如下所示:

  

Activity_1,Activity_2,Activity_2,Activity_3,Activity_5 ...

我正在尝试获取一个新列

Variants_stats为(每行):

  

Activity_1:1,Activity_2:2,Activity_3:1,Activity_5:1

我到目前为止采取的方法是: 1)创建一个UDF:

val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
val dfNew = df1.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))

在我尝试执行任何SQL或dfNew.show(false)调用之前,这似乎还可以(至少不会发出火花),这总是让我回来:

java.lang.StringIndexOutOfBoundsException: String index out of range: -84
    at java.lang.String.substring(String.java:1931)
    at java.lang.Class.getSimpleBinaryName(Class.java:1448)
    at java.lang.Class.getSimpleName(Class.java:1309)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage$lzycompute(ScalaUDF.scala:1055)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage(ScalaUDF.scala:1054)
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF.doGenCode(ScalaUDF.scala:1006)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
    at scala.Option.getOrElse(Option.scala:121)

我不知道这是怎么了?

Am使用Spark 2.1 +

要复制:

val items = List(
    "A_001,A_002,A_010,A_0200,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_074",
    "A_001,A_002,A_010,A_0201,A_0201,A_0201,A_0202,A_0206,A_0207,A_0208,A_0208,A_0209,A_070,A_071,A_072,A_073,A_073,A_073")
val df = sc.parallelize(items).toDF("VARIANTS")
df.show(false)
df.printSchema

// create UDF function
val countActivityFrequences = udf((value: String) => value.split(",").map(_.trim).groupBy(identity).mapValues(_.length).map{case (k, v) => k + ":" + v}.mkString(","))
// Apply UDF against our little DF
var dfNew = df.withColumn("Variants_stats", countActivityFrequences($"VARIANTS"))
dfNew.printSchema
// Error Thrown : (either Malforned class name, or java.lang.StringIndexOutOfBoundsException )
dfNew.show(false) 

更新

该问题仅出现在齐柏林飞艇下的AWS EMR环境中。 重新启动解释器即可正常工作。

0 个答案:

没有答案