如何通过levenshtein距离合并相似列的行

时间:2019-09-02 12:08:26

标签: sql presto amazon-athena

我正在使用AWS Athena,并且尝试合并具有特定列且levenshtein_distance值小于5的所有行,并对归一化的百分比求和。

该表具有以下结构:

CREATE EXTERNAL TABLE `actions`(
  `id` string COMMENT 'from deserializer', 
  `text` string COMMENT 'from deserializer',
  `normalizedpercentage` float COMMENT 'from deserializer', 
  `timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://xxxxxx/db/actions'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1566991410')

这就是我想做的:

WITH t AS 
    (SELECT id,
         text,
         normalizedPercentage
    FROM actions
    WHERE actions.timestamp
        BETWEEN timestamp '2019-08-01 00:00:01'
            AND timestamp '2019-08-31 23:59:59' )
SELECT *,
         SUM(normalizedPercentage)
    OVER (PARTITION BY levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5) AS cumulative
FROM t

不幸的是,PARTITION BY子句仅接受列名。

我当时正在考虑定义一个函数,并使用它遍历所有行,但是在Presto中这似乎是不可能的。

1 个答案:

答案 0 :(得分:0)

您可以根据自己的函数在临时表中计算新列,然后将该列用于主查询中的分区

WITH t AS 
(SELECT id,
     text,
     normalizedPercentage,case when  levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5 then 'groupA' else 'groupB' end as classification
FROM actions
WHERE actions.timestamp
    BETWEEN timestamp '2019-08-01 00:00:01'
        AND timestamp '2019-08-31 23:59:59' )
   SELECT *,
     SUM(normalizedPercentage)
OVER (PARTITION BY classification ) AS cumulative
FROM t