Question

我正在尝试使用Json阅读Apache Spark数据。这是我到目前为止所尝试的代码：

val conf = new SparkConf()
      .setAppName("ExplodeDemo")
      .setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
val df = sqlContext.read.json("file location")
df.printSchema()

当我将文件名作为参数传递给sqlContext.read.json时效果很好，但我的要求是直接传递json String文件。

因为我试过我试过这样：

val rdd = sc.parallelize(Seq(r))
val df = sqlContext.read.json(rdd)
df.printSchema()

其中r是我的json String，通过使用此代码，没有编译错误。但是，当我尝试df.printSchema（）时，它显示如下，并且无法检索数据。

root
 |-- _corrupt_record: string (nullable = true)

Answer 1

嗯，您还需要提供架构。

DataFrame只是一个带有Schema的RDD。在使用Datasoure API时，Spark将通过读取文件来推断架构。由于您没有使用Datasoure APi自动推断架构，因此您需要显式传递架构。

val YOURSCHEMA= StructType(Array(
  StructField("Attribute1", LongType, true),
  StructField("Attribute2", IntType, true)))
val df=spark.read.schema(YOURSCHEMA).json(rdd)
df.printSchema

无法使用Spark读取json数据

1 个答案: