我有一个唯一的字符串列表(最初的想法是表中的列名)。 该任务是执行列表的最大可能缩写,因此列表保持不同。
例如,AAA, AB
可以缩写为AA, AB
。 (但不要A, AB
–因为A
可以同时是AAA
和AB
的前缀)。
AAAA, BAAAA
可以缩短为A, B
。
但是A1, A2
根本不能缩写。
以下是示例数据
create table tab as
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;
预期结果是
COL ABR_COL
------ ------------------------
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
我管理了由四个子查询组成的蛮力解决方案,我没有故意发布,因为我希望有一个更简单的解决方案,不想让我分心。
在r
中有一个名为abbreviate
的类似功能,但我正在寻找SQL解决方案。欢迎使用针对其他RDBMS的首选Oracle
解决方案。
答案 0 :(得分:3)
使用递归CTE实际上是可行的。我并没有真正使它比三个子查询(加上一个查询)短,但是至少它不受字符串长度的限制。步骤大致如下:
表格:
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
表
ABBR CONFLICT
---- --------
AA 3
AAA 2
AABA 1
...
AAA
与其他一些缩写冲突,但是仍然必须选择它,因为它等于其未缩写的名称。表
COL ABBR CONFLICT POS
-------------------------------
AAA AAA 2 1
AAAAAA AAAA 1 1
AAAAAA AAAAA 1 2
AAAAAA AAAAAA 1 3
AABA AAB 1 1
...
表
COL ABBR POS
-------------------
AAA AAA 1
AAAAAA AAAA 1
AABA AAB 1
...
这将导致以下SQL,并将上述步骤作为CTE:
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations
where length(abbr) > 1
)
, abbreviation_counts as (
select abbr
, count(*) as conflict
from potential_abbreviations
group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
p.col
, p.abbr
, conflict
, rank() over (partition by col order by p.abbr) as pos
from potential_abbreviations p
join abbreviation_counts c on p.abbr = c.abbr
where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
order by col, abbr
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
答案 1 :(得分:3)
我将在递归CTE中进行过滤:
with potential_abbreviations(col, abbr, lev) as (
select col, col as abbr, 1 as lev
from tab
union all
select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
from potential_abbreviations pa
where length(abbr) > 1 and
not exists (select 1
from tab
where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
tab.col <> pa.col
)
)
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
from potential_abbreviations pa
) pa
where seqnum = 1
Here是db <>小提琴。
严格不需要lev
。您可以在length(abbr) desc
中使用order by
。但是,当我使用递归CTE时,通常会包含一个递归计数器,所以这是习惯。
在CTE中进行额外的比较可能看起来更复杂,但它简化了执行-递归以正确的值停止。
这也在唯一的单个字母col
值上进行了测试。
答案 2 :(得分:1)
我发现了第二种方法,它没有添加到第一个答案中,因为它又短又不同。步骤如下:
SQL
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
结果
col abbr
--- -------
AAA AAA
AAA AA
AAA A
...
min()
聚合是无关紧要的。SQL
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
结果
ABBR CONFLICTS BEST_CANDIDATE
------- --------- ---------------
AAAA 1 AAAAAA
AAAAA 1 AAAAAA
AAAAAA 1 AAAAAA
AAB 1 AABA
AABA 1 AABA
...
SQL
select
p.col as col
, nvl(min(c.abbr), p.col) as abbr
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col
order by col, abbr
with potential_abbreviations(col,abbr) as (
select
col
, col as abbr
from tab
union all
select
col
, substr(abbr, 1, length(abbr)-1 ) as abbr
from potential_abbreviations a
where length(abbr) > 1
)
, conflict_free as (
select
abbr
, count(*) as conflicts
, min(col) as best_candidate
from potential_abbreviations
group by abbr
having count(*) = 1
)
select
p.col as col
-- , c.best_candidate
, nvl(min(c.abbr), p.col) as abbr
-- , min(c.abbr) over (partition by c.best_candidate) shortest
from potential_abbreviations p
left join conflict_free c on p.col = c.best_candidate
where c.conflicts = 1 or p.abbr = p.col
group by p.col, c.best_candidate
order by col, abbr
COL ABBR
------- ----
AAA AAA
AAAAAA AAAA
AABA AAB
AB AB
AC1 AC
AD AD
BAAAA BA
BBAA BB
COL1 COL1
COL21 COL2
注意:对于Postgresql,递归CTE必须为with recursive
,而Oracle根本不喜欢recursive
一词。