Question

我有10个列的CSV文件。半字符串和一半是整数。

Scala代码是什么：

创建（推断）架构
将该架构保存到文件

到目前为止，我有这个：

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

保存该架构的最佳文件格式是什么？是JSON吗？

目标是 - 我只想创建一次架构，下次从文件加载而不是动态地重新创建它。

感谢。

Answer 1

DataType API提供了所有必需的实用程序，因此JSON是一个自然的选择：

import org.apache.spark.sql.types._
import scala.util.Try

val df = Seq((1L, "foo", 3.0)).toDF("id", "x1", "x2")
val serializedSchema: String = df.schema.json


def loadSchema(s: String): Option[StructType] =
  Try(DataType.fromJson(s)).toOption.flatMap {
    case s: StructType => Some(s)
    case _ => None 
  }

loadSchema(serializedSchema)

根据您的要求，您可以使用standard Scala methods to write this to file或破解Spark RDD：

val schemaPath: String = ???

sc.parallelize(Seq(serializedSchema), 1).saveAsTextFile(schemaPath)
val loadedSchema: Option[StructType] = sc.textFile(schemaPath)
  .map(loadSchema)  // Load
  .collect.headOption.flatten  // Make sure we don't fail if there is no data

对于Python等效项，请参阅Config file to define JSON Schema Struture in PySpark

如何从CSV文件创建架构并将该架构保存/保存到文件？

1 个答案: