Spark scala - Dataframe - 将模式文本文件(输入)转换为复杂的Json模式(输出)

时间:2018-01-23 16:56:24

标签: arrays json scala apache-spark spark-dataframe

我有输入和输出模式如下。我是火花和斯卡拉的新人。有人可以帮我转换加载文本文件的数据框,最后将其转换为land json文件。

INPUT

    val OTBCleanFile: Array[StructField] = Array(
          StructField("varbl_1_txt", StringType, true),
          StructField("varbl_2_txt", StringType, true),
          StructField("varbl_3_txt", StringType, true),
          StructField("varbl_4_txt", StringType, true),
          StructField("varbl_5_txt", StringType, true),
          StructField("varbl_6_txt", StringType, true),
          StructField("varbl_7_txt", StringType, true),
          StructField("varbl_8_txt", StringType, true),
          StructField("varbl_9_txt", StringType, true),
          StructField("varbl_10_txt", StringType, true),
          StructField("varbl_11_txt", StringType, true),
          StructField("varbl_12_txt", StringType, true),
          StructField("varbl_13_txt", StringType, true),
          StructField("varbl_14_txt", StringType, true),
          StructField("varbl_15_txt", StringType, true),
          StructField("email", StringType, true))

输出

val JsonFileScma = (new StructType)
  .add("col1",  (new StructType)
    .add("col2",  StringType)
    .add("col3",  StringType)
    .add("col4",  StringType)
    .add("col5",  StringType)
    .add("col6",  StringType)
    .add("col7",  StringType))
  .add("email", (new StructType)
    .add("type", StringType)
    .add("value", StringType))
  .add("templateId", StringType)

映射可以是一对一的,并且从输入文件/架构中留下很少的字段。

先谢谢你, 问候, Dattu

1 个答案:

答案 0 :(得分:0)

****工作守则****

    // Initialize

    val conf = new                                                              
    SparkConf().setAppName("process_text_to_json").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlc = new org.apache.spark.sql.SQLContext(sc)    

// Reading File iso-8859-1

val delimiter = "\307"
val OTBInputDF = sqlc.read
  .format("com.databricks.spark.csv")
  .option("header", "false") // Use first line of all files as header
  .option("delimiter", delimiter)
  .option("charset","iso-8859-1")
  .schema(StructType(OTBCleanFile))
  .load("cln.dat")

  // Convert to Json

  val data = OTBInputDF.selectExpr("varbl_1_txt as col1"  ,"varbl_1_txt as col2","varbl_1_txt as col3","varbl_1_txt as col4","varbl_1_txt col5","email as col6")

  val data2 = data.select(to_json(struct(col("col1"),col("col2"),col("col3"),col("col4"),col("col5"))) as "clientTag",col("col6"))
  data2.show()
  data2.printSchema()