如何编写查询以识别具有相似声音的名称?

时间:2016-05-16 03:27:58

标签: sql oracle oracle-sqldeveloper

如何编写查询以识别具有类似声音的名称(可能包括非英语名称)? Soundex似乎没有很好地处理非英文名字。

代码应该能够识别出例如以下(或大多数)是具有相似声音的名称?


Helena - Elena 
Violet - Viola 
Beatrix - Beatrice 
Madeline - Madeleine (ma-duh-LINE vs ma-duh-LEN) 
Alice - Elise 
Madeline - Adeline 
Kristen - Kirsten 
Lily - Millie 
Charlotte - Scarlett 
Zara / Lara / Sara / Mara 
Elena - Alana 
Emily - Emmeline 
Amelia - Amalia 
Stella - Bella - Ella 
Isabel - Isabeau 
Holly - Hallie 
Laura - Lara 
Fiona - Finola 
Louise - Eloise 
Cara - Clara 
Susanna vs Susannah 
Nora vs Norah 
Talia vs Tahlia vs Thalia 
Catherine vs Katherine 
Cecilia vs Cecelia 
Lucy vs Lucie 
Vivian vs Vivien 
Lillian vs Lilian 
Gwendolen vs Gwendolyn 
Sofia vs Sophia 
Isabel vs Isobel vs Isabelle 
Seraphina vs Serafina 
Juliet vs Juliette 
Annabel vs Annabelle 
Emily vs Emilie 
Elisabeth vs Elizabeth 
...and non-English names too.

1 个答案:

答案 0 :(得分:3)

使用像Levenshtein Distance这样的算法来比较两个序列之间的相似性会有帮助吗?

https://en.wikipedia.org/wiki/Levenshtein_distance

特别是在Oracle中,您可以使用utl_match

例如:

--Find closest names based on UTL_MATCH.EDIT_DISTANCE.
with names as
(
    --Names data.
    select column_value name
    from table(sys.odcivarchar2list('Adeline','Alana','Alice','Amalia','Amelia','Annabel',
    'Annabelle','Beatrice','Beatrix','Bella','Cara','Catherine','Cecelia','Cecilia',
    'Charlotte','Clara','Elena','Elisabeth','Elise','Elizabeth','Ella','Eloise','Emilie',
    'Emily','Emmeline','Finola','Fiona','Gwendolen','Gwendolyn','Hallie','Helena','Holly',
    'Isabeau','Isabel','Isabelle','Isobel','Juliet','Juliette','Katherine','Kirsten',
    'Kristen','Lara','Laura','Lilian','Lillian','Lily','Louise','Lucie','Lucy',
    'Madeleine','Madeline','Mara','Millie','Nora','Norah','Sara','Scarlett','Serafina',
    'Seraphina','Sofia','Sophia','Stella','Susanna','Susannah','Tahlia','Talia','Thalia',
    'Viola','Violet','Vivian','Vivien','Zara'))
)
--Name with the closest matches. 
select name1, edit_distance, listagg(name2, ',') within group (order by name2) names
from
(
    --Compare strings.
    select names1.name name1, names2.name name2
        ,utl_match.edit_distance(names1.name, names2.name) edit_distance
        ,min(utl_match.edit_distance(names1.name, names2.name))
            over (partition by names1.name) min_edit_distance
    from names names1
    cross join names names2
    --This cross join could get expensive.  It may help to add conditions here to
    --filter out obvious non-matches.  For example, maybe throw out rows where the
    --string length is vastly different?
    where names1.name <> names2.name
    order by 1, 3, 2
)
where edit_distance = min_edit_distance
group by name1, edit_distance
order by 1;

结果:

NAME1      EDIT_DISTANCE  NAMES
-----      -------------  -----
Adeline    2              Madeline
Alana      2              Clara,Elena
Alice      2              Elise
Amalia     1              Amelia
Amelia     1              Amalia
Annabel    2              Annabelle
Annabelle  2              Annabel
Beatrice   2              Beatrix
Beatrix    2              Beatrice
Bella      2              Ella,Stella
Cara       1              Clara,Lara,Mara,Sara,Zara
Catherine  1              Katherine
Cecelia    1              Cecilia
Cecilia    1              Cecelia
Charlotte  4              Scarlett
Clara      1              Cara
Elena      2              Alana,Ella,Helena
Elisabeth  1              Elizabeth
Elise      1              Eloise
Elizabeth  1              Elisabeth
Ella       2              Bella,Elena
Eloise     1              Elise
Emilie     2              Emily
Emily      2              Emilie,Lily
Emmeline   3              Adeline,Emilie,Madeline
Finola     2              Fiona,Viola
Fiona      2              Finola,Viola
Gwendolen  1              Gwendolyn
Gwendolyn  1              Gwendolen
Hallie     2              Millie
Helena     2              Elena
Holly      3              Bella,Ella,Emily,Hallie,Lily
Isabeau    2              Isabel
Isabel     1              Isobel
Isabelle   2              Isabel
Isobel     1              Isabel
Juliet     2              Juliette
Juliette   2              Juliet
Katherine  1              Catherine
Kirsten    2              Kristen
Kristen    2              Kirsten
Lara       1              Cara,Laura,Mara,Sara,Zara
Laura      1              Lara
Lilian     1              Lillian
Lillian    1              Lilian
Lily       2              Emily,Lucy
Louise     3              Elise,Eloise,Lucie
Lucie      2              Lucy
Lucy       2              Lily,Lucie
Madeleine  1              Madeline
Madeline   1              Madeleine
Mara       1              Cara,Lara,Sara,Zara
Millie     2              Hallie
Nora       1              Norah
Norah      1              Nora
Sara       1              Cara,Lara,Mara,Zara
Scarlett   4              Charlotte
Serafina   2              Seraphina
Seraphina  2              Serafina
Sofia      2              Sophia
Sophia     2              Sofia
Stella     2              Bella
Susanna    1              Susannah
Susannah   1              Susanna
Tahlia     1              Talia
Talia      1              Tahlia,Thalia
Thalia     1              Talia
Viola      2              Finola,Fiona,Violet
Violet     2              Viola
Vivian     1              Vivien
Vivien     1              Vivian
Zara       1              Cara,Lara,Mara,Sara