如何在不对列名进行硬编码的情况下在数据框中分解结构?

时间:2017-11-03 20:17:01

标签: scala apache-spark dataframe apache-spark-sql

下面以我的一个数据集为例。这是df.printSchema()

的结果
member: struct (nullable = true)
 |   address: struct (nullable = true)
 |    |   city: string (nullable = true)
 |    |   state: string (nullable = true)
 |    |   streetAddress: string (nullable = true)
 |    |   zipCode: string (nullable = true)
 |   birthDate: string (nullable = true)
 |   groupIdentification: string (nullable = true)
 |   memberCode: string (nullable = true)
 |   patientName: struct (nullable = true)
 |    |   first: string (nullable = true)
 |    |   last: string (nullable = true)
memberContractCode: string (nullable = true)
memberContractType: string (nullable = true)
memberProductCode: string (nullable = true)

这个数据是通过json读取的,我想把它弄平,所以所有都在同一级别,所以我的数据帧只包含原始类型,如下所示:

member.address.city: string (nullable = true)
member.address.state: string (nullable = true)
member.address.streetAddress: string (nullable = true)
member.address.zipCode: string (nullable = true)
member.birthDate: string (nullable = true)
member.groupIdentification: string (nullable = true)
member.memberCode: string (nullable = true)...

我知道这可以通过手动指定列名来完成,如下所示:

df = df.withColumn("member.address.city", df("member.address.city")).withColumn("member.address.state", df("member.address.state"))...

但是,由于程序需要能够动态处理新数据集而不对实际代码进行任何更改,因此我无法对上述所有数据集的列名进行硬编码。我想制作一个可以爆炸任何类型结构的通用方法,因为它已经在数据框中并且模式已知(但是是完整模式的子集)。这可能在Spark 1.6中有用吗?如果是的话,

1 个答案:

答案 0 :(得分:4)

这应该这样做 - 你需要迭代架构并“平铺”它,通过将“{1}}类型的字段与”简单“字段分开处理:

StructType