SQL按组排序

时间:2018-02-15 10:46:13

标签: sql amazon-athena

我正在寻找GROUP BY和ORDER BY之间的中介。

我想通过Key对行进行分组。但是,我无法使用像ARRAY_AGG这样的聚合函数将一个组收集到一行中,原因很简单,一些较大的组中有数百万条记录,导致严重的内存问题,数据库没有为这么大的行准备

另一种方法是按Key对整个数据库进行排序。但是,我通过AWS Athena使用Presto,而Presto需要在一台机器上收集所有记录以对它们进行排序。数据库太大而无法放入单个节点的内存中,所以这也不是一个选项。

我需要介于两者之间。保持行具有相同的键彼此相邻,没有聚合,也没有严格按键排序。这样,分组可以以分布式方式完成,并且结果中没有大量行的问题。

示例:

示例数据

Key            | Data
-----------------------------------------------------------------------
London         | ["N J M London",null,null,"N J M","London",["United States"],null]
Moreno De Vega | ["V Moreno De Vega",null,null,"V","Moreno De Vega",null,null]
Paspatis       | ["G A Paspatis",null,null,"G A","Paspatis",null,null]
Macdonald      | ["A J Macdonald",null,null,"A J","Macdonald",null,null]
Masterman      | ["J Masterman",null,null,"J","Masterman",null,null]
Nørager        | ["C B Nørager",null,null,"C B","Nørager",null,null]
Maggiori       | ["L Maggiori",null,null,"L","Maggiori",null,null]
Nocito         | ["A Nocito",null,null,"A","Nocito",null,null]
Díaz Nieto     | ["R Díaz Nieto",null,null,"R","Díaz Nieto",null,null]
Christoforidis | ["D Christoforidis",null,null,"D","Christoforidis",null,null]
Sillen         | ["Ulla Sillen",null,null,"Ulla","Sillen",null,null]
Riew           | ["K Daniel Riew",null,null,"K Daniel","Riew",null,null]
Matsumine      | ["Hajime Matsumine",null,null,"Hajime","Matsumine",["United States"],null]
Taylor         | ["Fraser Taylor",null,null,"Fraser","Taylor",null,null]
Buser          | ["Aalen Gerd Buser",null,null,"Aalen Gerd","Buser",null,["Klinische Monatsblätter für Augenheilkunde, Artemis Zentren, Dillenburg, 

按键分组的样本,这就是我想要的

Key            | Data
-----------------------------------------------------------------------
London         | ["N J M London",null,null,"N J M","London",["United States"],null]
London         | ["John London",null,null,"John","London",["Austria"],null]
Moreno De Vega | ["V Moreno De Vega",null,null,"V","Moreno De Vega",null,null]
Moreno De Vega | ["Victoria Moreno De Vega",null,null,"Victoria","Moreno De Vega",null,null]
Moreno De Vega | ["V. Moreno De Vega",null,null,"V.","Moreno De Vega",null,null]
Paspatis       | ["G A Paspatis",null,null,"G A","Paspatis",null,null]
Macdonald      | ["A J Macdonald",null,null,"A J","Macdonald",null,null]
Masterman      | ["J Masterman",null,null,"J","Masterman",null,null]
Masterman      | ["James Masterman",null,null,"James","Masterman",null,null]

编辑:

我目前的尝试。

使用GROUP BY和ARRAY_AGG。这太重了,

SELECT
  CAST(array_agg(_id) AS JSON) AS paper_id,
  CAST(array_agg(date) AS JSON) AS date,
  CAST(array_agg(title) AS JSON) AS title,
  CAST(array_agg(abstract) AS JSON) AS abstract,
  CAST(array_agg(keywords) AS JSON) AS keywords,
  CAST(array_agg(authors) AS JSON) AS authors,
  CAST(array_agg(author.name) AS JSON) AS name,
  CAST(array_agg(author._id) AS JSON) AS _id,
  CAST(array_agg(author.ids) AS JSON) AS ids,
  CAST(array_agg(author.firstnames) AS JSON) AS firstnames,
  CAST(array_agg(author.surname) AS JSON) AS surname,
  CAST(array_agg(author.countries) AS JSON) AS countries,
FROM main
CROSS JOIN UNNEST(authors) as t(author)
GROUP BY LOWER(author.surname)

使用ORDER BY,再次,太沉重了:

SELECT
  _id,
  date,
  title,
  abstract,
  keywords,
  authors,
  author.name,
  author._id,
  author.ids,
  author.firstnames,
  author.surname,
  author.countries,
FROM main
CROSS JOIN UNNEST(authors) as t(author)
ORDER BY LOWER(author.surname)

1 个答案:

答案 0 :(得分:2)

关于完成你想要的工作,我没有很好的答案。您通过在Presto中对单个计算机上的所有记录进行排序来正确识别限制。值得一提的是,Starburst团队在Presto实施分布式排序的工作已接近尾声。一旦Athena合并并获取此功能,您就可以尝试我们提到的方法。

如果您想尝试更快地应用补丁部署Presto,这里是Pull请求: https://github.com/prestodb/presto/pull/9854

免责声明:我为Starburst工作。但我作为Presto爱好者回答这篇文章。