如何编写查询以识别具有类似声音的名称(可能包括非英语名称)? Soundex似乎没有很好地处理非英文名字。
代码应该能够识别出例如以下(或大多数)是具有相似声音的名称?
Helena - Elena
Violet - Viola
Beatrix - Beatrice
Madeline - Madeleine (ma-duh-LINE vs ma-duh-LEN)
Alice - Elise
Madeline - Adeline
Kristen - Kirsten
Lily - Millie
Charlotte - Scarlett
Zara / Lara / Sara / Mara
Elena - Alana
Emily - Emmeline
Amelia - Amalia
Stella - Bella - Ella
Isabel - Isabeau
Holly - Hallie
Laura - Lara
Fiona - Finola
Louise - Eloise
Cara - Clara
Susanna vs Susannah
Nora vs Norah
Talia vs Tahlia vs Thalia
Catherine vs Katherine
Cecilia vs Cecelia
Lucy vs Lucie
Vivian vs Vivien
Lillian vs Lilian
Gwendolen vs Gwendolyn
Sofia vs Sophia
Isabel vs Isobel vs Isabelle
Seraphina vs Serafina
Juliet vs Juliette
Annabel vs Annabelle
Emily vs Emilie
Elisabeth vs Elizabeth
...and non-English names too.
答案 0 :(得分:3)
使用像Levenshtein Distance这样的算法来比较两个序列之间的相似性会有帮助吗?
https://en.wikipedia.org/wiki/Levenshtein_distance
特别是在Oracle中,您可以使用utl_match
。
例如:
--Find closest names based on UTL_MATCH.EDIT_DISTANCE.
with names as
(
--Names data.
select column_value name
from table(sys.odcivarchar2list('Adeline','Alana','Alice','Amalia','Amelia','Annabel',
'Annabelle','Beatrice','Beatrix','Bella','Cara','Catherine','Cecelia','Cecilia',
'Charlotte','Clara','Elena','Elisabeth','Elise','Elizabeth','Ella','Eloise','Emilie',
'Emily','Emmeline','Finola','Fiona','Gwendolen','Gwendolyn','Hallie','Helena','Holly',
'Isabeau','Isabel','Isabelle','Isobel','Juliet','Juliette','Katherine','Kirsten',
'Kristen','Lara','Laura','Lilian','Lillian','Lily','Louise','Lucie','Lucy',
'Madeleine','Madeline','Mara','Millie','Nora','Norah','Sara','Scarlett','Serafina',
'Seraphina','Sofia','Sophia','Stella','Susanna','Susannah','Tahlia','Talia','Thalia',
'Viola','Violet','Vivian','Vivien','Zara'))
)
--Name with the closest matches.
select name1, edit_distance, listagg(name2, ',') within group (order by name2) names
from
(
--Compare strings.
select names1.name name1, names2.name name2
,utl_match.edit_distance(names1.name, names2.name) edit_distance
,min(utl_match.edit_distance(names1.name, names2.name))
over (partition by names1.name) min_edit_distance
from names names1
cross join names names2
--This cross join could get expensive. It may help to add conditions here to
--filter out obvious non-matches. For example, maybe throw out rows where the
--string length is vastly different?
where names1.name <> names2.name
order by 1, 3, 2
)
where edit_distance = min_edit_distance
group by name1, edit_distance
order by 1;
结果:
NAME1 EDIT_DISTANCE NAMES
----- ------------- -----
Adeline 2 Madeline
Alana 2 Clara,Elena
Alice 2 Elise
Amalia 1 Amelia
Amelia 1 Amalia
Annabel 2 Annabelle
Annabelle 2 Annabel
Beatrice 2 Beatrix
Beatrix 2 Beatrice
Bella 2 Ella,Stella
Cara 1 Clara,Lara,Mara,Sara,Zara
Catherine 1 Katherine
Cecelia 1 Cecilia
Cecilia 1 Cecelia
Charlotte 4 Scarlett
Clara 1 Cara
Elena 2 Alana,Ella,Helena
Elisabeth 1 Elizabeth
Elise 1 Eloise
Elizabeth 1 Elisabeth
Ella 2 Bella,Elena
Eloise 1 Elise
Emilie 2 Emily
Emily 2 Emilie,Lily
Emmeline 3 Adeline,Emilie,Madeline
Finola 2 Fiona,Viola
Fiona 2 Finola,Viola
Gwendolen 1 Gwendolyn
Gwendolyn 1 Gwendolen
Hallie 2 Millie
Helena 2 Elena
Holly 3 Bella,Ella,Emily,Hallie,Lily
Isabeau 2 Isabel
Isabel 1 Isobel
Isabelle 2 Isabel
Isobel 1 Isabel
Juliet 2 Juliette
Juliette 2 Juliet
Katherine 1 Catherine
Kirsten 2 Kristen
Kristen 2 Kirsten
Lara 1 Cara,Laura,Mara,Sara,Zara
Laura 1 Lara
Lilian 1 Lillian
Lillian 1 Lilian
Lily 2 Emily,Lucy
Louise 3 Elise,Eloise,Lucie
Lucie 2 Lucy
Lucy 2 Lily,Lucie
Madeleine 1 Madeline
Madeline 1 Madeleine
Mara 1 Cara,Lara,Sara,Zara
Millie 2 Hallie
Nora 1 Norah
Norah 1 Nora
Sara 1 Cara,Lara,Mara,Zara
Scarlett 4 Charlotte
Serafina 2 Seraphina
Seraphina 2 Serafina
Sofia 2 Sophia
Sophia 2 Sofia
Stella 2 Bella
Susanna 1 Susannah
Susannah 1 Susanna
Tahlia 1 Talia
Talia 1 Tahlia,Thalia
Thalia 1 Talia
Viola 2 Finola,Fiona,Violet
Violet 2 Viola
Vivian 1 Vivien
Vivien 1 Vivian
Zara 1 Cara,Lara,Mara,Sara