我正在使用AWS Athena,并且尝试合并具有特定列且levenshtein_distance值小于5的所有行,并对归一化的百分比求和。
该表具有以下结构:
CREATE EXTERNAL TABLE `actions`(
`id` string COMMENT 'from deserializer',
`text` string COMMENT 'from deserializer',
`normalizedpercentage` float COMMENT 'from deserializer',
`timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://xxxxxx/db/actions'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1566991410')
这就是我想做的:
WITH t AS
(SELECT id,
text,
normalizedPercentage
FROM actions
WHERE actions.timestamp
BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59' )
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5) AS cumulative
FROM t
不幸的是,PARTITION BY子句仅接受列名。
我当时正在考虑定义一个函数,并使用它遍历所有行,但是在Presto中这似乎是不可能的。
答案 0 :(得分:0)
您可以根据自己的函数在临时表中计算新列,然后将该列用于主查询中的分区
WITH t AS
(SELECT id,
text,
normalizedPercentage,case when levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5 then 'groupA' else 'groupB' end as classification
FROM actions
WHERE actions.timestamp
BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59' )
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY classification ) AS cumulative
FROM t