匹配同一列中所有行中的单词

时间:2017-07-08 11:02:13

标签: oracle plsql

表格中有一栏专栏' mytable'命名'描述'。

+----+-------------------------------+
| ID | Description                   |
+----+-------------------------------+
| 1  | My NAME is Sajid KHAN         |
| 2  | My Name is Ahmed Khan         |
| 3  | MY friend name is Salman Khan |
+----+-------------------------------+

我需要编写一个Oracle SQL查询/过程/函数来列出列中的不同单词。

输出应为:

+------------------+-------+
| Word             | Count |
+------------------+-------+
| MY               |     3 |
| NAME             |     3 |
| IS               |     3 |
| SAJID            |     1 |
| KHAN             |     3 |
| AHMED            |     1 |
| FRIEND           |     1 |
| SALMAN           |     1 |
+------------------+-------+

单词匹配应该不区分大小写。

我使用的是Oracle 12.1。

2 个答案:

答案 0 :(得分:1)

让我们假设我们会以某种方式设法将每个描述分成单词。 所以,而不是Id = 1和Description ='我的NAME是Sajid KHAN'的单行,而不是像这样的5行

ID  | Description
--- | ------------
 1  | My 
 1  | NAME 
 1  | is 
 1  | Sajid 
 1  | KHAN

以这种形式,它是微不足道的,类似于

select Description, count(*) from data_in_new_form group by Description

所以,让我们使用递归查询。

create table mytable
as
select 1 as ID, 'My NAME is Sajid KHAN' as Description from dual
union all 
select 2, 'My Name is Ahmed Khan' from dual
union all
select 3, 'MY friend name is Salman Khan' from dual
union all
select 4, 'test, punctuation! it is' from dual
;


with
rec (id, str, depth, element_value) as
(
    -- Anchor member.
    select id, upper(Description) as str, 1 as depth, REGEXP_SUBSTR( upper(Description), '(.*?)( |$)', 1, 1, NULL, 1 ) AS element_value
     from mytable
    UNION ALL
    -- Recursive member.
    select id, str, depth + 1, REGEXP_SUBSTR( str ,'(.*?)( |$)', 1, depth+1, NULL, 1 ) AS element_value
     from rec
    where depth < regexp_count(str, ' ')+1
)
, data as (
select * from rec
--order by id, depth
)
select element_value, count(*) from data
group by element_value
order by element_value
;

请注意,如果单词用空格分隔,此版本对标点符号不做任何操作。

UPDATE 使用分层查询的替代方式

with rec as
(
    SELECT id, LEVEL AS depth,
    REGEXP_SUBSTR( upper(description) ,'(.*?)( |$)', 1, LEVEL, NULL, 1 ) AS element_value
    FROM   mytable
    CONNECT BY LEVEL <= regexp_count(description, ' ')+1
    and prior id = id
    and prior SYS_GUID() is not null
)
, data as (
select * from rec
--order by id, depth
)
select element_value, count(*) from data
group by element_value
order by 2 desc
;

答案 1 :(得分:0)

此查询将起作用。单词的顺序可能不同。但是,如你所列的那样,一开始就会出现频繁的单词。

  SELECT word,
      COUNT(*)
       FROM
      (SELECT TRIM (REGEXP_SUBSTR (Description, '[^ ]+', 1, ROWNUM) ) AS Word
       FROM
        (SELECT LISTAGG(UPPER(Description),' ') within GROUP(
          ORDER BY ROWNUM ) AS Description
        FROM mytable
        )
        CONNECT BY LEVEL <= REGEXP_COUNT ( Description, '[^ ]+')
      )
    GROUP BY WORD
    ORDER BY 2 DESC;