如何在数组中展平嵌套结构?

时间:2017-06-26 17:21:01

标签: java apache-spark apache-spark-sql

这是我目前的架构:

 |-- _id: string (nullable = true)
 |-- person: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- adr1: struct (nullable = true)
 |    |    |    |-- resid: string (nullable = true)

这就是我想要获得的:

 |-- _id: string (nullable = true)
 |-- person: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- resid: string (nullable = true)

我正在使用java api。

2 个答案:

答案 0 :(得分:3)

您可以使用map转化:

import java.util.Arrays;
import java.util.stream.Collectors;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;

Encoder<PeopleFlatten> peopleFlattenEncoder = Encoders.bean(PeopleFlatten.class);

people
  .map(person -> new PeopleFlatten(
      person.get_id(),
      person.getPerson().stream().map(p ->
        new PersonFlatten(
          p.getName(),
          p.getAdr1().getResid()
        )
      ).collect(Collectors.toList())
    ),
    peopleFlattenEncoder
  );

其中PeopleFlattenPersonFlatten是对应于预期架构的POJO。

public class PeopleFlatten implements Serializable {
   private String _id;
   private List<PersonFlatten> person;
   // getters and setters
}

public class PersonFlatten implements Serializable {
   private String name;
   private String resid;
   // getters and setters
}

答案 1 :(得分:2)

如果是Scala,我会做以下事情,但由于OP询问了Java,我只是提供它作为指导。

解决方案1 ​​ - Memory-Heavy

case class Address(resid: String)
case class Person(name: String, adr1: Address)

val people = Seq(
  ("one", Array(Person("hello", Address("1")), Person("world", Address("2"))))
).toDF("_id", "persons")

import org.apache.spark.sql.Row
people.as[(String, Array[Person])].map { case (_id, arr) => 
  (_id, arr.map { case Person(name, Address(resid)) => (name, resid) })
}

然而,这种方法非常耗费内存,因为内部二进制行被复制到它们的JVM对象,这些对象使环境面向OutOfMemoryErrors。

解决方案2 - 昂贵但与语言无关

更差性能的另一个查询(但更少内存要求)可以使用explode运算符来首先对数组进行解构,以便我们轻松访问内部结构。

val solution = people.
  select($"_id", explode($"persons") as "exploded"). // <-- that's expensive
  select("_id", "exploded.*"). // <-- this is the trick to access struct's fields
  select($"_id", $"name", $"adr1.resid").
  select($"_id", struct("name", "resid") as "person").
  groupBy("_id"). // <-- that's expensive
  agg(collect_list("person") as "persons")
scala> solution.printSchema
root
 |-- _id: string (nullable = true)
 |-- persons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- resid: string (nullable = true)

该解决方案的优点在于它几乎与Scala或Java无关(因此无论您选择哪种语言,都可以立即使用它。)