我正在寻找GROUP BY和ORDER BY之间的中介。
我想通过Key对行进行分组。但是,我无法使用像ARRAY_AGG这样的聚合函数将一个组收集到一行中,原因很简单,一些较大的组中有数百万条记录,导致严重的内存问题,数据库没有为这么大的行准备
另一种方法是按Key对整个数据库进行排序。但是,我通过AWS Athena使用Presto,而Presto需要在一台机器上收集所有记录以对它们进行排序。数据库太大而无法放入单个节点的内存中,所以这也不是一个选项。
我需要介于两者之间。保持行具有相同的键彼此相邻,没有聚合,也没有严格按键排序。这样,分组可以以分布式方式完成,并且结果中没有大量行的问题。
示例:
示例数据
Key | Data
-----------------------------------------------------------------------
London | ["N J M London",null,null,"N J M","London",["United States"],null]
Moreno De Vega | ["V Moreno De Vega",null,null,"V","Moreno De Vega",null,null]
Paspatis | ["G A Paspatis",null,null,"G A","Paspatis",null,null]
Macdonald | ["A J Macdonald",null,null,"A J","Macdonald",null,null]
Masterman | ["J Masterman",null,null,"J","Masterman",null,null]
Nørager | ["C B Nørager",null,null,"C B","Nørager",null,null]
Maggiori | ["L Maggiori",null,null,"L","Maggiori",null,null]
Nocito | ["A Nocito",null,null,"A","Nocito",null,null]
Díaz Nieto | ["R Díaz Nieto",null,null,"R","Díaz Nieto",null,null]
Christoforidis | ["D Christoforidis",null,null,"D","Christoforidis",null,null]
Sillen | ["Ulla Sillen",null,null,"Ulla","Sillen",null,null]
Riew | ["K Daniel Riew",null,null,"K Daniel","Riew",null,null]
Matsumine | ["Hajime Matsumine",null,null,"Hajime","Matsumine",["United States"],null]
Taylor | ["Fraser Taylor",null,null,"Fraser","Taylor",null,null]
Buser | ["Aalen Gerd Buser",null,null,"Aalen Gerd","Buser",null,["Klinische Monatsblätter für Augenheilkunde, Artemis Zentren, Dillenburg,
按键分组的样本,这就是我想要的
Key | Data
-----------------------------------------------------------------------
London | ["N J M London",null,null,"N J M","London",["United States"],null]
London | ["John London",null,null,"John","London",["Austria"],null]
Moreno De Vega | ["V Moreno De Vega",null,null,"V","Moreno De Vega",null,null]
Moreno De Vega | ["Victoria Moreno De Vega",null,null,"Victoria","Moreno De Vega",null,null]
Moreno De Vega | ["V. Moreno De Vega",null,null,"V.","Moreno De Vega",null,null]
Paspatis | ["G A Paspatis",null,null,"G A","Paspatis",null,null]
Macdonald | ["A J Macdonald",null,null,"A J","Macdonald",null,null]
Masterman | ["J Masterman",null,null,"J","Masterman",null,null]
Masterman | ["James Masterman",null,null,"James","Masterman",null,null]
编辑:
我目前的尝试。
使用GROUP BY和ARRAY_AGG。这太重了,
SELECT
CAST(array_agg(_id) AS JSON) AS paper_id,
CAST(array_agg(date) AS JSON) AS date,
CAST(array_agg(title) AS JSON) AS title,
CAST(array_agg(abstract) AS JSON) AS abstract,
CAST(array_agg(keywords) AS JSON) AS keywords,
CAST(array_agg(authors) AS JSON) AS authors,
CAST(array_agg(author.name) AS JSON) AS name,
CAST(array_agg(author._id) AS JSON) AS _id,
CAST(array_agg(author.ids) AS JSON) AS ids,
CAST(array_agg(author.firstnames) AS JSON) AS firstnames,
CAST(array_agg(author.surname) AS JSON) AS surname,
CAST(array_agg(author.countries) AS JSON) AS countries,
FROM main
CROSS JOIN UNNEST(authors) as t(author)
GROUP BY LOWER(author.surname)
使用ORDER BY,再次,太沉重了:
SELECT
_id,
date,
title,
abstract,
keywords,
authors,
author.name,
author._id,
author.ids,
author.firstnames,
author.surname,
author.countries,
FROM main
CROSS JOIN UNNEST(authors) as t(author)
ORDER BY LOWER(author.surname)
答案 0 :(得分:2)
关于完成你想要的工作,我没有很好的答案。您通过在Presto中对单个计算机上的所有记录进行排序来正确识别限制。值得一提的是,Starburst团队在Presto实施分布式排序的工作已接近尾声。一旦Athena合并并获取此功能,您就可以尝试我们提到的方法。
如果您想尝试更快地应用补丁部署Presto,这里是Pull请求: https://github.com/prestodb/presto/pull/9854
免责声明:我为Starburst工作。但我作为Presto爱好者回答这篇文章。