如何选择与另一个字符串最匹配的子字符串

时间:2018-11-25 09:44:39

标签: sql oracle oracle12c string-matching

让我们说完整的字符串是

  

下面的示例检查字符串,寻找以逗号分隔的第一个子字符串

而subString是

  

有界

如果使用sql包含90%匹配的subString,有什么方法可以检查完整的字符串

就像我的示例中的 substing bounded substring bounded 一样。

subString可以包含更多单词,因此我不能将整个字符串拆分为单词。

2 个答案:

答案 0 :(得分:2)

首先将您的文本转换成单词表。您会在SO上找到很多关于此主题的文章,例如here

您必须调整定界字符列表以仅提取单词。

这是一个示例查询

 with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
      t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
      t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
      where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
 select * from t3;

COL      
----------
The        
following  
example    
examines
...

下一步,您的Levenshtein Distance得到结束词。

 with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
      t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
      t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
      where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
 select col, str, UTL_MATCH.EDIT_DISTANCE(col, str)  distance
 from t3
 cross join (select 'commas' str from dual)
 order by 3;

COL        STR      DISTANCE
---------- ------ ----------
comas      commas          1 
for        commas          5 
examines   commas          6 
...

检查Levenshtein距离的定义,并在距离上定义一个阈值,以获取您的候选单词。

与单词边界无关的匹配,请在您的输入中进行简单扫描,并获得匹配字符串的十分之一的所有子字符串,这些字符串针对差异进行了调整,例如添加约10%。

您可以通过仅过滤从单词边界开始的子字符串来限制候选者。其余的计算相同。

 with txt as (select  'The following example examines the string, looking for the first substring bounded by comas' txt from dual),
      str as (select  'substing bounded' str from dual),
      t1 as (select  substr(txt, rownum, (select length(str) * 1.1 from str)) substr, /* add 10% length for the match */
                     (select str from str) str 
             from txt connect by level < (select length(txt) from txt) - (select length(str) from str)) 
 select SUBSTR, STR, 
        UTL_MATCH.EDIT_DISTANCE(SUBSTR, STR)  distance
 from t1
 order by 3;

SUBSTR               STR                DISTANCE
-------------------- ---------------- ----------
substring bounded    substing bounded          1 
ubstring bounded     substing bounded          3 
 substring bounde    substing bounded          3 
t substring bound    substing bounded          5 
...

答案 1 :(得分:0)

使用SOUNDEX函数的实验。

我尚未对此进行测试,但这可能会对您有所帮助:

    WITH strings AS (
      select regexp_substr('The following example examines the string, looking for the first substring bounded by comas','[ ]+', 1, level) ss 
      from dual
      connect by regexp_substr('The following example examines the string, looking for the first substring bounded by comas', '[ ]+', 1, level) is not null
    )
    SELECT ss 
    FROM strings
    WHERE SOUNDEX(ss) = SOUNDEX( 'commas' ) ;

REGEXP_SUBSTRCONNECT BY将长字符串分成单词(按空格)-根据需要修改定界符,以包括标点符号等。

在这里,我们依靠符合我们期望的内置SOUNDEX