无法为RDD创建DataFrame

时间:2017-09-05 11:40:38

标签: scala apache-spark spark-dataframe rdd

我尝试使用动态模式生成创建Dataframe。这是代码段:

def mapMetricList(row: Row): Seq[Metric] = ???

val fields = Seq("Field1", "Field2")

case class Metric(name: String, count: Long)
def convertMetricList(df: DataFrame): DataFrame = {
  val outputFields = df.schema.fieldNames.filter(f => fields.contains(f))

  val rdd = df.rdd.map(row => {
    val schema = row.schema
    val metrics = mapMetricList(row)
    val s = outputFields.map(name => row.get(schema.fieldIndex(name)))
    Row.fromSeq(s ++ Seq(metrics))
  })

  val nonMetricsSchema = outputFields.map( f => df.schema.apply(f))
  val metricField = StructField("total",ArrayType(ScalaReflection.schemaFor[Metric].dataType.asInstanceOf[StructType]),nullable=true)
  val schema = StructType(nonMetricsSchema ++ Seq(metricField))
  schema.printTreeString()
  val dff = spark.createDataFrame(rdd, schema)
  dff
}

但是我在运行期间不断收到这些异常:

Caused by: java.lang.RuntimeException: Metric is not a valid external type for schema of struct<name:string,count:bigint>
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfCondExpr3$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr4$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)

我使用的是Spark 2.1.0

1 个答案:

答案 0 :(得分:0)

在我的计算机上使用Spark 1.6工作得很好,我打印了“convertMetricList”函数的结果。 也许在“metricField”字段“count”类型中出现问题。在您提到的跟踪“bigint”中,我的env类型是“LongType”:

StructField(total,ArrayType(
    StructType(StructField(name,StringType,true), 
    StructField(count,LongType,false)
),true),true)

您可以在您的环境中查看“metricField”类型。如果不同,解决方法是硬编码度量标准结构。