带有复杂和嵌套数据的Spark数据框

时间:2019-04-18 21:54:04

标签: scala apache-spark apache-spark-sql azure-databricks

我目前有3个数据框 称它们为dfA,dfB和dfC

dfA有3列


| Id |姓名|年龄|

dfB说5列。第二列是返回dFA记录的FK参考。


| Id | AId |街|城市|邮编|

dfC的别名为3个列,同时也引用了dfA


| Id | AId | SomeField |

使用Spark SQL,我可以跨3个进行JOIN

%sql

SELECT * FROM dfA
INNER JOIN dfB ON dfA.Id = dfB.AId
INNER JOIN dfC ON dfA.Id = dfC.AId

我将得到我的结果集,但是它已被“展平”,就像SQL会对像这样的表格结果那样。

我想将其加载到这样的复杂模式中

val destinationSchema = new StructType()
  .add("id", IntegerType)
  .add("name", StringType)
  .add("age", StringType)
  .add("b", 
       new StructType()
        .add("street", DoubleType, true)
        .add("city", StringType, true)
    .add("zip", StringType, true)
      )
  .add("c",
       new StructType()
        .add("somefield", StringType, true)
      )

有什么想法如何获取SELECT的结果并通过指定架构保存到数据框?

我最终想保存复杂的StructType或JSON,并使用Mongo Spark连接器将其加载到Mongo DB。

或者,是否有更好的方法从3个单独的数据帧(最初是读取的3个单独的CSV文件)中完成此任务?

2 个答案:

答案 0 :(得分:1)

从csv文件加载的三个数据帧中,您可以执行以下操作:

import org.apache.spark.sql.functions._

val destDF = atableDF
  .join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
  .join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
  .select($"id",$"name",$"age",struct($"street",$"city",$"zip") as "b",struct($"somefield") as "c")

val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))

它将输出:

row
"{""id"":100,""name"":""John"",""age"":""43"",""b"":{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""},""c"":{""somefield"":""appples""}}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""},""c"":{""somefield"":""bananas""}}"
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""},""c"":{""somefield"":""pears""}}"

答案 1 :(得分:1)

如果所有记录都具有1:1的关系,则上一个记录会起作用。

以下是实现1:M的方法(提示:使用collect_set对行进行分组)

import org.apache.spark.sql.functions._

val destDF = atableDF
  .join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
  .join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
  .groupBy($"id",$"name",$"age")
  .agg(collect_set(struct($"street",$"city",$"zip")) as "b",collect_set(struct($"somefield")) as "c")

val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))

display(jsonDestDF)

这将为您提供以下输出:

row
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":[{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""}],""c"":[{""somefield"":""pears""},{""somefield"":""pineapples""}]}"
"{""id"":100,""name"":""John"",""age"":""43"",""b"":[{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""}],""c"":[{""somefield"":""appples""}]}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":[{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""}],""c"":[{""somefield"":""grapes""},{""somefield"":""peaches""},{""somefield"":""bananas""}]}"

我使用的示例数据是为了防止任何人想要玩的游戏

atable.csv

100,"John",43
101,"Sally",34
102,"Damian",23
104,"Rita",14
105,"Mohit",23

btable.csv:

100,"Dark Road","Washington",98002
101,"Light Ave","Los Angeles",90210
102,"Short Street","New York",70701
104,"Long Drive","Buffalo",80345
105,"Circular Quay","Orlando",65403

ctable.csv:

100,"appples"
101,"bananas"
102,"pears"
101,"grapes"
102,"pineapples"
101,"peaches"