我正在尝试将我的应用程序从Spark 1.6.2更新到2.0.0,我的问题是从Dataframe创建数据集(我读过的镶木地板)。
我知道我可以使用case类或tuple来输入Dataframe然后有一个数据集但是在运行时之前我不知道哪些数据会加载用户,所以列的类型和数量。
要加载数据,我使用SparkSession从镶木地板读取数据,简单如下:
spark.read.schema(schema).parquet(dataPath)
schemaOfData是一个StructType,由List [Map [String,String]]实例化,它包含列的名称和类型(除了String,否则为Double)。
我在StackOverflow上发现了这个,但是如果没有更简单的方法来解决我的问题,我很难理解它和访客: Dynamically compiling scala class files at runtime in Scala 2.11
由于
答案 0 :(得分:0)
创建从Spark数据类型到Scala本机数据类型的隐式转换。
然后使用Spark DataFrame的StructFields映射到架构的映射
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
val spark = SparkSession
.builder
.appName("Movies Reviews")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
someDF.printSchema()
def schemaCaseClass(schema:StructType, className:String)
(implicit sparkTypeScala:DataType => String):String = {
def structField(col:StructField):String = {
val sparkTypes = sparkTypeScala(col.dataType)
col match {
case x if x.nullable => s" ${col.name}:Option[$sparkTypes]"
case _ => s" ${col.name}:$sparkTypes"
}
}
val fieldsName = schema.map(structField).mkString(",\n ")
s"""
|case class $className (
| $fieldsName
|)
""".stripMargin
}
implicit val scalaTypes:DataType => String = {
case _: ByteType => "Byte"
case _: ShortType => "Short"
case _: IntegerType => "Int"
case _: LongType => "Long"
case _: FloatType => "Float"
case _: DoubleType => "Double"
case _: DecimalType => "java.math.BigDecimal"
case _: StringType => "String"
case _: BinaryType => "Array[Byte]"
case _: BooleanType => "Boolean"
case _: TimestampType => "java.sql.Timestamp"
case _: DateType => "java.sql.Date"
case _: ArrayType => "scala.collection.Seq"
case _: MapType => "scala.collection.Map"
case _: StructType => "org.apache.spark.sql.Row"
case _ => "String"
}
println(schemaCaseClass(someDF.schema, "someDF"))