如何在Spark中读取多嵌套JSON数据

时间:2018-02-07 12:08:24

标签: apache-spark apache-spark-sql

如何读取Spark中的多嵌套JSON数据。我有JSON文件 Json Schema

我需要将此架构格式提取到TherapeuticArea行项目,如下所示:

trialTherapeuticAreas_ID,trialTherapeuticAreas_name,trialDiseases_id,trialDiseases_name,trialPatientSegments_id,trialPatientSegments_name

1 个答案:

答案 0 :(得分:0)

您需要以嵌套的方式explode数组,并在单独的列中选择struct元素为此,您需要explode内置函数和{ {1}} api和别名

代码尝试:

select

您应该满足您的要求

您也可以使用三个import org.apache.spark.sql.functions._ val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas"))) .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), explode(col("trialTherapeuticAreas.trialDiseases")).as("trialDiseases")) .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), explode(col("trialDiseases.trialPatientSegments")).as("trialPatientSegments")) .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases_id"), col("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name")) api和一个withColumn语句执行上述转换

select

不建议对大型数据集连续使用import org.apache.spark.sql.functions._ val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas"))) .withColumn("trialDiseases", explode(col("trialTherapeuticAreas.trialDiseases"))) .withColumn("trialPatientSegments", explode(col("trialDiseases.trialPatientSegments"))) .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name")) ,因为它可能会提供随机输出。原因是withColumn是分布式的,并且未按顺序方式证明执行顺序