Question

我有一组词：

"dog", "car", "house", "work", "cat"

我需要能够在文本中匹配至少3个，例如：

"I always let my cat and dog at the animal nursery when I go to work by car"

这里我想匹配正则表达式，因为它匹配至少3个单词（这里是4个单词）：

"cat", "dog", "car" and "work"

编辑1

我想将它与 Oracle 的 regexp_like 功能一起使用

编辑2

我还需要它来处理连续的单词

Answer 1

由于Oracle regexp_like不支持非捕获组和字边界，因此可以使用以下表达式：

^((.*? )?(dog|car|house|work|cat)( |$)){3}.*$

Try it out here

或者，更大但可以说更清洁的解决方案是：

^(.*? )?(dog|car|house|work|cat) .*?(dog|car|house|work|cat) .*?(dog|car|house|work|cat)( .*)?$

Try it out here

注意：这些都会匹配多次使用的相同字词，例如＆＃34;狗狗狗＆＃34;。

编辑：为了解决标点符号问题，可以进行一些小修改。它不是完美的，但应该匹配99％的涉及标点符号的情况（但不会匹配，例如!dog）：

^((.*? )?(dog|car|house|work|cat)([ ,.!?]|$)){3}.*$

Try it out here

Answer 2

这是一个不使用正则表达式的解决方案，会排除重复的单词，匹配的单词可以作为集合中的绑定参数传入：

SQL Fiddle

Oracle 11g R2架构设置：

创建一个集合类型来存储单词列表：

find / -type d -user greg | grep -v proc

创建一个PL / SQL函数，将分隔的字符串拆分为集合：

CREATE TYPE StringList IS TABLE OF VARCHAR2(50)
/

创建一些测试数据：

CREATE OR REPLACE FUNCTION split_String(
  i_str    IN  VARCHAR2,
  i_delim  IN  VARCHAR2 DEFAULT ','
) RETURN StringList DETERMINISTIC
AS
  p_result       StringList := StringList();
  p_start        NUMBER(5) := 1;
  p_end          NUMBER(5);
  c_len CONSTANT NUMBER(5) := LENGTH( i_str );
  c_ld  CONSTANT NUMBER(5) := LENGTH( i_delim );
BEGIN
  IF c_len > 0 THEN
    p_end := INSTR( i_str, i_delim, p_start );
    WHILE p_end > 0 LOOP
      p_result.EXTEND;
      p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, p_end - p_start );
      p_start := p_end + c_ld;
      p_end := INSTR( i_str, i_delim, p_start );
    END LOOP;
    IF p_start <= c_len + 1 THEN
      p_result.EXTEND;
      p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, c_len - p_start + 1 );
    END IF;
  END IF;
  RETURN p_result;
END;
/

查询1 ：

CREATE TABLE test_data ( value ) AS
SELECT 'I always let my cat and dog at the animal nursery when I go to work by car' FROM DUAL UNION ALL
SELECT 'dog dog foo bar dog' FROM DUAL
/

<强> Results ：

SELECT *
FROM   test_data
WHERE  CARDINALITY(
         split_string( value, ' ' )    -- Split the string into a collection
         MULTISET INTERSECT            -- Intersect it with the input words
         StringList( 'dog', 'car', 'house', 'work', 'cat' )
       ) >= 3                          -- Check that the size of the intersection
                                       -- is at least 3 items.

Answer 3

忽略我在原帖中的评论中提出的问题，这里有一个简单的方法来解决问题，加入和聚合（使用HAVING条件）。请注意，输入中的doghouse之类的字词会与dog和house等相匹配（请阅读原帖后的评论！）

在下面的查询中，输入短语和要匹配的单词都在因式子查询（WITH子句）中进行了硬编码。在严肃的环境中，两者都应该在基表中，或者作为输入变量等提供。

我将展示如何使用标准字符串比较运算符LIKE。这可以更改为REGEXP_LIKE，但这通常是不需要的（实际上是一个坏主意）。但如果你需要区分“狗”和“狗”。和狗的＆＃39; （和＆＃39; dogwood＆＃39;），或需要不区分大小写的比较等，您可以使用REGEXP_LIKE。这个解决方案的重点是你不必担心匹配三个不同的词;如果您知道如何匹配ONE（无论是否需要完整的单词匹配，大小写是否有效等等），那么您也可以轻松地在相同的规则下匹配三个单词。

with
  inputs ( input_phrase ) as (
    select
  'I always let my cat and dog at the animal nursery when I go to work by car'
    from   dual
  ),
  words ( word_to_match) as (
    select 'dog'   from dual union all
    select 'car'   from dual union all
    select 'house' from dual union all
    select 'work'  from dual union all
    select 'cat'   from dual
  )
select   input_phrase
from     inputs inner join words 
                on input_phrase like '%' || word_to_match || '%'
group by input_phrase
having   count(*) >= 3
;

INPUT_PHRASE                                                              
--------------------------------------------------------------------------
I always let my cat and dog at the animal nursery when I go to work by car

Answer 4

以下解决方案将排除重复匹配，不使用正则表达式（如果您愿意，可以使用），并且不使用PL / SQL。

WITH match_list ( match_word ) AS (
    SELECT 'dog' AS match_word FROM dual
     UNION ALL
    SELECT 'work' FROM dual
     UNION ALL
    SELECT 'car' FROM dual
     UNION ALL
    SELECT 'house' FROM dual
     UNION ALL
    SELECT 'cat' FROM dual
)
SELECT phrase, COUNT(*) AS unique_match_cnt, SUM(match_cnt) AS total_match_cnt
     , LISTAGG(match_word, ',') WITHIN GROUP ( ORDER BY match_word ) AS unique_matches
  FROM (
    SELECT pt.phrase, ml.match_word, COUNT(*) AS match_cnt
      FROM phrase_table pt INNER JOIN match_list ml
        ON ' ' || LOWER(pt.phrase) || ' ' LIKE '%' || ml.match_word || '%'
     GROUP BY pt.phrase, ml.match_word
) GROUP BY phrase
HAVING COUNT(*) >= 3;

关键是将要匹配的单词放入表或公用表表达式/子查询中。如果您愿意，可以使用REGEXP_LIKE()代替LIKE，但我认为这会更贵。如果您未使用Oracle 11g或更高版本，或者如果您实际上并不需要知道匹配哪些字词，请跳过LISTAGG()，如果您想要区分大小写，请跳过LOWER()匹配。

Answer 5

如果您不需要匹配不同的单词。

(?:\b(?:dog|car|house|work|cat)\b.*?){3}

我不知道这是否适用于您的环境。

编辑：我没有看到另外一个答案几乎就像这个。

从5个单词中以任意顺序匹配至少3个单词

编辑1

编辑2

5 个答案: