Question

我正在尝试使用OpenstreetMaps API在我的Django应用程序中填充某个州的城市。该数据库已经填充了一些城市。我面临重复数据问题，因为城市中的名称有时会带有特殊字符。

例如在国家Turkey中，州Bursa的城市为Gursu。我的数据库有一个名称为Gürsu的城市对象。 Openstreet Map API的城市名称为Gürsü。我正在尝试找到一种解决方案，以将现有城市与特殊字符名称进行匹配，并在存在的情况下对其进行更新。这样我就可以避免重复。

Answer 1

涉及unicode的解决方案是根据UAX＃10匹配文本。您可以在数据库中或在Python中（可能使用PyICU）做到这一点。这是一些简短的代码演示：

#!/usr/bin/env perl
use 5.010;
use utf8;
use open qw(:std :encoding(UTF-8));
use Unicode::Collate qw();

my $c = Unicode::Collate->new(normalization => undef, level => 1);
my @g = qw(Gursu Gürsu Gursü Gürsü);

for my $o (@g) {
    for my $i (@g) {
        say "$i matches $o" if -1 != $c->index($o, $i, 0);
    }
}

__END__
Gursu matches Gursu
Gürsu matches Gursu
Gursü matches Gursu
Gürsü matches Gursu
Gursu matches Gürsu
Gürsu matches Gürsu
Gursü matches Gürsu
Gürsü matches Gürsu
Gursu matches Gursü
Gürsu matches Gursü
Gursü matches Gursü
Gürsü matches Gursü
Gursu matches Gürsü
Gürsu matches Gürsü
Gursü matches Gürsü
Gürsü matches Gürsü

Answer 2

首先，它们不相同，它们具有不同的ASCII值。但是，如果您想将ü与u进行匹配为此，您需要做一些工作，将您认为相似的字符归类到一个列表中，因为这是粗略的解决方案，可以相应地修改

import difflib
similar_groups=[['ü','u']] #add similar special characters here
country = 'Gursu'
country_b = 'Gürsü'
output_list = list(set([li[-1:] for li in difflib.ndiff(country, country_b) if li[0] != ' ']))
match=False #keep false for match found
print(output_list)
for val in similar_groups:
    if(sorted(output_list)==sorted(val)):
        match=True
    else:
        match=False

if match:
    print("Equal")
    #update or skip your stuff here

是否可以找到或匹配两个具有不同特殊字符的名称django

2 个答案: