我有输入和输出模式如下。我是火花和斯卡拉的新人。有人可以帮我转换加载文本文件的数据框,最后将其转换为land json文件。
INPUT
val OTBCleanFile: Array[StructField] = Array(
StructField("varbl_1_txt", StringType, true),
StructField("varbl_2_txt", StringType, true),
StructField("varbl_3_txt", StringType, true),
StructField("varbl_4_txt", StringType, true),
StructField("varbl_5_txt", StringType, true),
StructField("varbl_6_txt", StringType, true),
StructField("varbl_7_txt", StringType, true),
StructField("varbl_8_txt", StringType, true),
StructField("varbl_9_txt", StringType, true),
StructField("varbl_10_txt", StringType, true),
StructField("varbl_11_txt", StringType, true),
StructField("varbl_12_txt", StringType, true),
StructField("varbl_13_txt", StringType, true),
StructField("varbl_14_txt", StringType, true),
StructField("varbl_15_txt", StringType, true),
StructField("email", StringType, true))
输出:
val JsonFileScma = (new StructType)
.add("col1", (new StructType)
.add("col2", StringType)
.add("col3", StringType)
.add("col4", StringType)
.add("col5", StringType)
.add("col6", StringType)
.add("col7", StringType))
.add("email", (new StructType)
.add("type", StringType)
.add("value", StringType))
.add("templateId", StringType)
映射可以是一对一的,并且从输入文件/架构中留下很少的字段。
先谢谢你, 问候, Dattu
答案 0 :(得分:0)
****工作守则****
// Initialize
val conf = new
SparkConf().setAppName("process_text_to_json").setMaster("local")
val sc = new SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)
// Reading File iso-8859-1
val delimiter = "\307"
val OTBInputDF = sqlc.read
.format("com.databricks.spark.csv")
.option("header", "false") // Use first line of all files as header
.option("delimiter", delimiter)
.option("charset","iso-8859-1")
.schema(StructType(OTBCleanFile))
.load("cln.dat")
// Convert to Json
val data = OTBInputDF.selectExpr("varbl_1_txt as col1" ,"varbl_1_txt as col2","varbl_1_txt as col3","varbl_1_txt as col4","varbl_1_txt col5","email as col6")
val data2 = data.select(to_json(struct(col("col1"),col("col2"),col("col3"),col("col4"),col("col5"))) as "clientTag",col("col6"))
data2.show()
data2.printSchema()