Spark UnaryTransformer实现因scala.MatchError而失败

时间:2017-07-21 12:34:39

标签: scala apache-spark apache-spark-mllib

我在Spark 1.6.2中实现了一个UnaryTransformer。使用此界面:

class myUT(override val uid: String) extends UnaryTransformer[Seq[String], Seq[String], myUT] {
...
override protected def createTransformFunc: Seq[String] => Seq[String] = {
   _ => _.map(x => x + "s")
}

编译好但在运行时返回错误:

17/07/21 22:29:33 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, myhost.com.au): scala.MatchError: ArrayBuffer(<contents of my array>) (of class scala.collection.mutable.ArrayBuffer)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)

我接下来要做的就是替换

_ => _.map(x => x + "s")

_ => _

所以,理论上它应该意味着根本不会改变数据!但我得到的错误是:

17/07/21 22:11:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, myhost.com.au): scala.MatchError: WrappedArray(<contains of my array>) (of class scala.collection.mutable.WrappedArray$ofRef)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)

因此看起来传出数据的类型无论如何都会发生变化。我该如何避免这种情况?

更新:接下来我尝试将.toArray添加到地图中。现在错误是这样的:

[error] /sparkprj/src/main/scala/sp_txt.scala:43: polymorphic expression cannot be instantiated to expected type;
[error]  found   : [B >: String]Array[B]
[error]  required: Seq[String]
[error]                                           ).toArray

它可能会添加一些细节,但并不能增加我的理解。在回顾了一些mllib UnaryTransformer的例子后,我倾向于认为它是Catalyst中的一个错误。

1 个答案:

答案 0 :(得分:1)

myUT类定义中的这一行不正确:

override protected def outputDataType: DataType = new ArrayType(StringType, true)

当我从String-&gt; String转换器复制此类定义时,我将DataType定义为StringType。我的坏。