Oracle SQL在String中重复的单词

时间:2017-01-12 14:51:26

标签: sql oracle

我需要您对以下任务的建议/输入。我有下表:

ID      ID_NAME                             
------ ---------------------------------   
1       TOM HANKS TOM JR                    
2       PETER PATROL PETER JOHN PETER       
3       SAM LIVING                          
4       JOHNSON & JOHNSON INC               
5       DUHGT LLC                              
6       THE POST OF THE OFFICE              
7       TURNING REP WEST                    
8       GEORGE JOHN                         

我需要一个SQL查询来查找每个ID的重复单词。如果它存在,我需要得到重复单词的计数。 例如在ID 2中,单词PETER重复3次,在ID 1中单词TOM重复两次。所以我需要这样的输出:

ID      ID_NAME                             COUNT
------ ---------------------------------    --------
1       TOM HANKS TOM JR                    2
2       PETER PATROL PETER JOHN PETER       3
3       SAM LIVING                          0
4       JOHNSON & JOHNSON INC               2
5       DUHGT LLC                           0    
6       THE POST OF THE OFFICE              2
7       TURNING REP WEST                    0
8       GEORGE JOHN                         0

只是一个FYI,该表有560K行

我尝试了下面的内容,它没有用,它实际上是在寻找每个单词。

SELECT RESULT, COUNT(*)
    FROM (SELECT
            REGEXP_SUBSTR(COL_NAME, '[^ ]+', 1, COLUMN_VALUE) RESULT
          FROM TABLE_NAME T ,
               TABLE(CAST(MULTISET(SELECT DISTINCT LEVEL
                                   FROM TABLE_NAME X
              CONNECT BY LEVEL <= LENGTH(X.COL_NAME) - LENGTH(REPLACE(X.COL_NAME, ' ', '')) + 1
                                  ) AS SYS.ODCINUMBERLIST)) T1
          )
   WHERE RESULT IS NOT NULL
   GROUP BY RESULT
   ORDER BY 1;

请告诉我您的意见。

2 个答案:

答案 0 :(得分:1)

下面的查询计算重复的单词并返回最高计数(如果一个单词出现三次,另一个单词出现两次,结果将是数字3)。它将JOHN视为与John不同(如果大小写不应计为“不同”,则将输入字符串包装在UPPER(...)内)。它只将空间视为单词分隔符;如果其他内容(如破折号)也被视为分隔符,请添加到REGEXP搜索模式。确保你在正方形括号内的匹配字符列表的末尾添加一个短划线等 - 用于匹配字符列表的常用“技巧”。更一般地说,根据需要进行调整。

查询首先将每个输入字符串分成单个单词,并计算每个单词出现的次数。对于计数,我只需要GROUP BY子句中的单词(“标记”),我不需要实际SELECT它们,这就是为什么最内层的查询可能看起来很奇怪,如果你不是'预先警告。 (现在你是!)

如果没有重复的单词,你似乎也希望显示null而不是1,所以我编写了查询以适应这种情况。 (不知道为什么1不行。)

with
     test_data ( id, id_name ) as (
       select 1, 'TOM HANKS TOM JR'              from dual union all
       select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
       select 3, 'SAM LIVING'                    from dual union all
       select 4, 'JOHNSON & JOHNSON INC'         from dual union all
       select 5, 'DUHGT LLC'                     from dual union all
       select 6, 'THE POST OF THE OFFICE'        from dual union all
       select 7, 'TURNING REP WEST'              from dual union all
       select 8, 'GEORGE JOHN'                   from dual
     )
--  end of test data; SQL query begins below this line
select id, id_name, case when max(cnt) >= 2 then max(cnt) end as max_count
from (
       select id, id_name, count(*) as cnt
       from   test_data
       connect by level <= 1 + regexp_count(id_name, ' ')
              and prior id = id
              and prior sys_guid() is not null
              group by id, id_name, regexp_substr(id_name, '[^ ]+', 1, level)
     )
group by id, id_name
order by id        -- if needed
;

<强>输出

ID ID_NAME                        MAX_COUNT
-- ----------------------------- ----------
 1 TOM HANKS TOM JR                       2
 2 PETER PATROL PETER JOHN PETER          3
 3 SAM LIVING
 4 JOHNSON & JOHNSON INC                  2
 5 DUHGT LLC
 6 THE POST OF THE OFFICE                 2
 7 TURNING REP WEST
 8 GEORGE JOHN

8 rows selected.

修改

如果您只需要查找字符串列至少有一个重复单词的返回值,并且您不关心最高“重复字数”是什么或重复多少字,则解决方案更简单,更多高效;您不需要将输入字符串拆分为组件字并计算它们。

(经过长时间对话后,评论中指出,这就足够了。)

在解决方案中,regexp_like中的“匹配模式”搜索字符串,前面是字符串的开头或空格或短划线,以空格,逗号,句点,问号结尾,感叹号或短划线。可以根据需要修改用于单词开头和结尾的两个“标记”。确保短划线是[...]中的第一个或最后一个字符,在其他任何具有特殊含义的位置。

然后它会查找该单词的重复。这就是\2在匹配模式中的作用。它是2而不是1,因为“单词”在第二对括号中;我需要第一对用于交替,即开始字符串OR(空格或短划线)。

查看此查询正确覆盖的特殊情况的第一个和最后一个字符串。考虑查询可能包含或不包含的任何其他可能情况。

with
     test_data ( id, id_name ) as (
       select 1, 'TOM HANKS TOM-ALAN'            from dual union all
       select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
       select 3, 'SAM LIVING'                    from dual union all
       select 4, 'JOHNSON & JOHNSON INC'         from dual union all
       select 5, 'DUHGT LLC'                     from dual union all
       select 6, 'THE POST OF THE OFFICE'        from dual union all
       select 7, 'TURNING REP WEST'              from dual union all
       select 8, 'GEORGE JOHN-JOHN'              from dual
     )
--  end of test data; SQL query begins below this line
select id, id_name
from   test_data
where  regexp_like(id_name, '(^|[ -])([[:alpha:]]+)[ ,.?!-].*\2')
order by id   --   if needed
;

ID  ID_NAME
--  -----------------------------
 1  TOM HANKS TOM-ALAN
 2  PETER PATROL PETER JOHN PETER
 4  JOHNSON & JOHNSON INC
 6  THE POST OF THE OFFICE
 8  GEORGE JOHN-JOHN

答案 1 :(得分:0)

下一个解决方案找到第一个重复的单词,然后在下一步找到重复计数。刚刚编辑以修复额外的子字词结果

with s (ID, ID_NAME) as (
select 1, 'TOM HANKS TOM JR' from dual union all    
select 10, 'TO TOM TOM TOM TOM TO TO TO STOM HANKS TOM TOMMY' from dual union all    
select 2, 'PETER PATROL PETER JOHN PETER' from dual union all
select 3,  'SAM LIVING' from dual union all
select 4,  'qwe JOHNSON & JOHNSON INC' from dual union all
select 5,  'DUHGT LLC' from dual union all
select 6,  'THE POST OF THE OFFICE ' from dual union all
select 7,  'TURNING REP WEST ' from dual union all
select 8,  'GEORGE JOHN ' from dual)
select id,
       case when r1 = 0 then 0
            else regexp_count(id_name, r3)
               - regexp_count(id_name, r3||'\w+')  -- exlude word with tail
               - regexp_count(id_name, '\w+'||r3)  -- exclude words with head
               + regexp_count(id_name, '\w+'||r3||'\w+') -- double calc with head and tail
       end as rep_count
       from (
select
s.*,
regexp_instr(s.id_name, '(^|\s)(\w+)(\s|$)(.*(\2))+') as r1 ,
regexp_replace(s.id_name, '.*?(^|\s)(\w+)(\s)(.*(\s)\2(\s|$))+.*$', '\2') as r3
from s);

结果是

    ID  REP_COUNT
---------- ----------
     1      2
    10      4
     2      3
     3      0
     4      2
     5      0
     6      2
     7      0
     8      0