下面以我的一个数据集为例。这是df.printSchema()
的结果member: struct (nullable = true)
| address: struct (nullable = true)
| | city: string (nullable = true)
| | state: string (nullable = true)
| | streetAddress: string (nullable = true)
| | zipCode: string (nullable = true)
| birthDate: string (nullable = true)
| groupIdentification: string (nullable = true)
| memberCode: string (nullable = true)
| patientName: struct (nullable = true)
| | first: string (nullable = true)
| | last: string (nullable = true)
memberContractCode: string (nullable = true)
memberContractType: string (nullable = true)
memberProductCode: string (nullable = true)
这个数据是通过json读取的,我想把它弄平,所以所有都在同一级别,所以我的数据帧只包含原始类型,如下所示:
member.address.city: string (nullable = true)
member.address.state: string (nullable = true)
member.address.streetAddress: string (nullable = true)
member.address.zipCode: string (nullable = true)
member.birthDate: string (nullable = true)
member.groupIdentification: string (nullable = true)
member.memberCode: string (nullable = true)...
我知道这可以通过手动指定列名来完成,如下所示:
df = df.withColumn("member.address.city", df("member.address.city")).withColumn("member.address.state", df("member.address.state"))...
但是,由于程序需要能够动态处理新数据集而不对实际代码进行任何更改,因此我无法对上述所有数据集的列名进行硬编码。我想制作一个可以爆炸任何类型结构的通用方法,因为它已经在数据框中并且模式已知(但是是完整模式的子集)。这可能在Spark 1.6中有用吗?如果是的话,
答案 0 :(得分:4)
这应该这样做 - 你需要迭代架构并“平铺”它,通过将“{1}}类型的字段与”简单“字段分开处理:
StructType