Question

我必须使用json文件中的信息创建自定义org.apache.spark.sql.types.StructType架构对象，json文件可以是任何内容，因此我在属性文件中对其进行了参数化。

这是它看起来属性文件的方式：

//ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, aplicable a DataFrame (ver StructType.fromJson)
schema.parquet=/Users/XXXX/Desktop/generated_schema.json
writing.mode=overwrite
separator=;
header=false

文件generated_schema.json如下所示：

{"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]}

所以，这就是我认为我可以解决的问题：

val path: Path = new Path(mra_schema_parquet)
val fileSystem = path.getFileSystem(sc.hadoopConfiguration)
val inputStream: FSDataInputStream = fileSystem.open(path)
val schema_json = Stream.cons(inputStream.readLine(), Stream.continually( inputStream.readLine))

System.out.println("schema_json looks like "  + schema_json.head)

val mySchemaStructType :DataType = DataType.fromJson(schema_json.head)

/*
After this line, mySchemaStructType have four StructFields objects inside it, the same than appears at schema_json
*/
logger.info(mySchemaStructType)

val myStructType = new StructType()
myStructType.add("mySchemaStructType",mySchemaStructType)

/*

After this line, myStructType have zero StructFields! here must be the bug, myStructType should have the four StructFields that represents the loaded schema json! this must be the error! but how can i construct the necessary StructType object?

*/

myDF = loadCSV(sqlContext, path_input_csv,separator,myStructType,header)
System.out.println("myDF.schema.json looks like " + myDF.schema.json)
inputStream.close()

df.write
  .format("com.databricks.spark.csv")
  .option("header", header)
  .option("delimiter",delimiter)
  .option("nullValue","")
  .option("treatEmptyValuesAsNulls","true")
  .mode(saveMode)
  .parquet(pathParquet)

当代码运行最后一行.parquet（pathParquet）时，会发生异常：

**parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message root {
}**

此代码的输出如下：

16/11/11 13:57:04 INFO AnotherCSVtoParquet$: The job started using this propertie file: /Users/aisidoro/Desktop/mra-csv-converter/parametrizacion.properties
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: path_input_csv is /Users/aisidoro/Desktop/mra-csv-converter/cds_glcs.csv
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: path_output_parquet  is /Users/aisidoro/Desktop/output900000
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: mra_schema_parquet is /Users/aisidoro/Desktop/mra-csv-converter/generated_schema.json
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: writting_mode is overwrite
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: separator is ;
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: header is false
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: ATTENTION! aplying mra_schema_parquet  /Users/aisidoro/Desktop/mra-csv-converter/generated_schema.json
schema_json looks like {"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]}
16/11/11 13:57:12 INFO AnotherCSVtoParquet$: StructType(StructField(codigo,StringType,true), StructField(otro,StringType,true), StructField(vacio,StringType,true), StructField(final,StringType,true))
 16/11/11 13:57:13 INFO AnotherCSVtoParquet$: loadCSV. header is false, inferSchema is false pathCSV is /Users/aisidoro/Desktop/mra-csv-converter/cds_glcs.csv separator is ;
 myDF.schema.json looks like {"type":"struct","fields":[]}

应该是schema_json对象和myDF.schema.json对象应该具有相同的内容，不应该？但它没有发生。我认为这必须启动错误。

最后，这项工作因此例外而崩溃：

**parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message root {
}**

事实是，如果我不提供任何json架构文件，那么该作业表现良好，但是使用此架构......

任何人都可以帮助我吗？我只是想从csv文件和json模式文件开始创建一些镶木地板文件。

谢谢。

依赖关系是：

    <spark.version>1.5.0-cdh5.5.2</spark.version>
    <databricks.version>1.5.0</databricks.version>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>${spark.version}</version>
        <scope>compile</scope>
    </dependency>
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>${databricks.version}</version>
    </dependency>

更新

我可以看到有一个未解决的问题，

https://github.com/databricks/spark-csv/issues/61

Answer 1

既然你说Custom Schema，你可以这样做。

val schema = (new StructType).add("field1", StringType).add("field2", StringType)
sqlContext.read.schema(schema).json("/json/file/path").show

另外，请查看this和this

您可以创建如下所示的嵌套JSON模式。

例如：

{
  "field1": {
    "field2": {
      "field3": "create",
      "field4": 1452121277
    }
  }
}

val schema = (new StructType)
  .add("field1", (new StructType)
    .add("field2", (new StructType)
      .add("field3", StringType)
      .add("field4", LongType)
    )
  )

Answer 2

最后我发现了问题。

问题在于以下几点：

val myStructType = new StructType()
myStructType.add("mySchemaStructType",mySchemaStructType)

我必须使用这一行：

val mySchemaStructType = DataType.fromJson(schema_json.head).asInstanceOf[StructType]

我必须从DataType转换StructType才能使事情正常运行。

关于如何以编程方式从json文件开始创建自定义org.apache.spark.sql.types.StructType架构对象

2 个答案: