我有2列的流式数据框。表示为String的键列和对象列,对象列是包含一个对象元素的数组。我希望能够使用相同的键合并Dataframe中的记录或行,以使合并的记录形成对象数组。
Dataframe
----------------------------------------------------------------
|key | objects |
----------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}] |
|abc | [{"name": "image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
----------------------------------------------------------------
Merged Dataframe
-------------------------------------------------------------------------
|key | objects |
-------------------------------------------------------------------------
|abc | [{"name": "file", "type": "sample", "code": "123"}, {"name":
"image", "type": "sample", "code": "456"}] |
|xyz | [{"name": "doc", "type": "sample", "code": "707"}] |
--------------------------------------------------------------------------
执行此操作的一个选项是将其转换为PairedRDD并应用reduceByKey函数,但如果可能的话,我希望对Dataframe进行此操作,因为它会更优化。有什么办法可以在不影响性能的情况下使用数据框来做到这一点?
答案 0 :(得分:0)
假设列objects
是单个JSON字符串的数组,那么您可以通过objects
合并key
:
import org.apache.spark.sql.functions._
case class Obj(name: String, `type`: String, code: String)
val df = Seq(
("abc", Obj("file", "sample", "123")),
("abc", Obj("image", "sample", "456")),
("xyz", Obj("doc", "sample", "707"))
).
toDF("key", "object").
select($"key", array(to_json($"object")).as("objects"))
df.show(false)
// +---+-----------------------------------------------+
// |key|objects |
// +---+-----------------------------------------------+
// |abc|[{"name":"file","type":"sample","code":"123"}] |
// |abc|[{"name":"image","type":"sample","code":"456"}]|
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// +---+-----------------------------------------------+
df.groupBy($"key").agg(collect_list($"objects"(0)).as("objects")).
show(false)
// +---+---------------------------------------------------------------------------------------------+
// |key|objects |
// +---+---------------------------------------------------------------------------------------------+
// |xyz|[{"name":"doc","type":"sample","code":"707"}] |
// |abc|[{"name":"file","type":"sample","code":"123"}, {"name":"image","type":"sample","code":"456"}]|
// +---+---------------------------------------------------------------------------------------------+