我有一个结构如下的文本文件。
(employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName, Description, Duriation, Role}]})
例如:
(123456, Employee1, {“ProjectDetails”:[
{ “ProjectName”: “Web Develoement”, “Description” : “Online Sales website”, “Duration” : “6 Months” , “Role” : “Developer”}
{ “ProjectName”: “Spark Develoement”, “Description” : “Online Sales Analysis”, “Duration” : “6 Months” , “Role” : “Data Engineer”}
{ “ProjectName”: “Scala Training”, “Description” : “Training”, “Duration” : “1 Month” }
]
}
有人可以帮我解析一下吗?使用scala将记录展平为以下数据框?
employeeID, Name, ProjectName, Description, Duration, Role
123456, Employee1, Web Develoement, Online Sales website, 6 Months , Developer
123456, Employee1, Spark Develoement, Online Sales Analysis, 6 Months, Data Engineer
123456, Employee1, Scala Training, Training, 1 Month, null
答案 0 :(得分:0)
你可以尝试这个..但稍微修改了输入结构,因为前两列不是Json格式。
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.sql.SQLContext
scala> import org.apache.spark.sql._
scala> val sqlSC = new org.apache.spark.sql.SQLContext(sc)
scala> import sqlSC.implicits._
scala> val emp_DF =sqlSC.jsonFile("file:///C:/Users/ABCD/Desktop/Examples/Spark/Mailing List/Employee_Nested_Projects.json")
scala> case class ProjectInfo(ProjectName:String,Description:String,Duration:String,Role:String)
scala> case class Project(employeeID:Int,Name:String,ProjectDetails:Seq[ProjectInfo])
scala> val emp_projects_DF = emp_DF.explode(emp_DF("ProjectDetails")) {
case Row(x: Seq[Row])=> x.map(x=> ProjectInfo(x(0).asInstanceOf[String],x(1).asInstanceOf[String],x(2).asInstanceOf[String],x(3).asInstanceOf[String]))}
scala> emp_projects_DF.select($"employeeID",$"Name",$"ProjectName",$"Description",$"Duration",$"Role").show()