这是我目前的架构:
|-- _id: string (nullable = true)
|-- person: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- adr1: struct (nullable = true)
| | | |-- resid: string (nullable = true)
这就是我想要获得的:
|-- _id: string (nullable = true)
|-- person: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- resid: string (nullable = true)
我正在使用java api。
答案 0 :(得分:3)
您可以使用map
转化:
import java.util.Arrays;
import java.util.stream.Collectors;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
Encoder<PeopleFlatten> peopleFlattenEncoder = Encoders.bean(PeopleFlatten.class);
people
.map(person -> new PeopleFlatten(
person.get_id(),
person.getPerson().stream().map(p ->
new PersonFlatten(
p.getName(),
p.getAdr1().getResid()
)
).collect(Collectors.toList())
),
peopleFlattenEncoder
);
其中PeopleFlatten
和PersonFlatten
是对应于预期架构的POJO。
public class PeopleFlatten implements Serializable {
private String _id;
private List<PersonFlatten> person;
// getters and setters
}
public class PersonFlatten implements Serializable {
private String name;
private String resid;
// getters and setters
}
答案 1 :(得分:2)
如果是Scala,我会做以下事情,但由于OP询问了Java,我只是提供它作为指导。
case class Address(resid: String)
case class Person(name: String, adr1: Address)
val people = Seq(
("one", Array(Person("hello", Address("1")), Person("world", Address("2"))))
).toDF("_id", "persons")
import org.apache.spark.sql.Row
people.as[(String, Array[Person])].map { case (_id, arr) =>
(_id, arr.map { case Person(name, Address(resid)) => (name, resid) })
}
然而,这种方法非常耗费内存,因为内部二进制行被复制到它们的JVM对象,这些对象使环境面向OutOfMemoryErrors。
更差性能的另一个查询(但更少内存要求)可以使用explode
运算符来首先对数组进行解构,以便我们轻松访问内部结构。
val solution = people.
select($"_id", explode($"persons") as "exploded"). // <-- that's expensive
select("_id", "exploded.*"). // <-- this is the trick to access struct's fields
select($"_id", $"name", $"adr1.resid").
select($"_id", struct("name", "resid") as "person").
groupBy("_id"). // <-- that's expensive
agg(collect_list("person") as "persons")
scala> solution.printSchema
root
|-- _id: string (nullable = true)
|-- persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- resid: string (nullable = true)
该解决方案的优点在于它几乎与Scala或Java无关(因此无论您选择哪种语言,都可以立即使用它。)