Scala 2.11& Spark 2.0.0创建动态案例类来编码数据集

时间:2016-10-05 10:19:10

标签: scala apache-spark

我正在尝试将我的应用程序从Spark 1.6.2更新到2.0.0,我的问题是从Dataframe创建数据集(我读过的镶木地板)。

我知道我可以使用case类或tuple来输入Dataframe然后有一个数据集但是在运行时之前我不知道哪些数据会加载用户,所以列的类型和数量。

要加载数据,我使用SparkSession从镶木地板读取数据,简单如下:

spark.read.schema(schema).parquet(dataPath)

schemaOfData是一个StructType,由List [Map [String,String]]实例化,它包含列的名称和类型(除了String,否则为Double)。

我在StackOverflow上发现了这个,但是如果没有更简单的方法来解决我的问题,我很难理解它和访客: Dynamically compiling scala class files at runtime in Scala 2.11

由于

1 个答案:

答案 0 :(得分:0)

创建从Spark数据类型到Scala本机数据类型的隐式转换。

然后使用Spark DataFrame的StructFields映射到架构的映射

  import org.apache.spark.sql.SparkSession
  import org.apache.spark.sql.types._


    val spark = SparkSession
      .builder
      .appName("Movies Reviews")
      .config("spark.master", "local")
      .getOrCreate()

    import spark.implicits._
    val someDF = Seq(
      (8, "bat"),
      (64, "mouse"),
      (-27, "horse")
    ).toDF("number", "word")

    someDF.printSchema()

    def schemaCaseClass(schema:StructType, className:String)
                       (implicit sparkTypeScala:DataType => String):String = {
      def structField(col:StructField):String = {
        val sparkTypes = sparkTypeScala(col.dataType)
        col match {
          case x if x.nullable => s"  ${col.name}:Option[$sparkTypes]"
          case _ => s"  ${col.name}:$sparkTypes"
        }
      }

    val fieldsName = schema.map(structField).mkString(",\n  ")
    s"""
       |case class $className (
       |  $fieldsName
       |)
    """.stripMargin
    }

    implicit val scalaTypes:DataType => String = {
        case _: ByteType => "Byte"
        case _: ShortType => "Short"
        case _: IntegerType => "Int"
        case _: LongType => "Long"
        case _: FloatType => "Float"
        case _: DoubleType => "Double"
        case _: DecimalType => "java.math.BigDecimal"
        case _: StringType => "String"
        case _: BinaryType => "Array[Byte]"
        case _: BooleanType => "Boolean"
        case _: TimestampType => "java.sql.Timestamp"
        case _: DateType => "java.sql.Date"
        case _: ArrayType => "scala.collection.Seq"
        case _: MapType => "scala.collection.Map"
        case _: StructType => "org.apache.spark.sql.Row"
        case _ => "String"
    }


    println(schemaCaseClass(someDF.schema, "someDF"))