获取拉丁字符的所有unicode变体

时间:2019-07-23 17:38:40

标签: python python-3.x unicode unicode-normalization

例如,对于字符"a",我想获取一个像"aàáâãäåāăą"这样的字符串(字符列表)(不确定该示例列表是否完整...)(基本上所有带Unicode字符的字符名称"Latin Small Letter A with *")。

是否有一种通用的方式来获得这个?

我正在请求Python,但是如果答案更通用,那也很好,尽管在任何情况下我都希望Python代码片段。 Python> = 3.5很好。但是我想你需要访问Unicode数据库,例如Python模块unicodedata,相对于其他外部数据源,我更喜欢。

我可以想象这样的解决方案:

def get_variations(char):
   import unicodedata
   name = unicodedata.name(char)
   chars = char
   for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
      try: 
          chars += unicodedata.lookup("%s %s" % (name, variation))
      except KeyError:
          pass
   return chars

5 个答案:

答案 0 :(得分:6)

首先,获取结合了变音符号的Unicode集合; they're contiguous, so this is pretty easy,例如:

# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))

现在定义一个试图用基本ASCII字符组成每个函数的函数;当组成的标准格式的长度为1(意味着ASCII +组合成为单个Unicode序数)时,将其保存:

import unicodedata

def get_unicode_variations(letter):
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = []
    # We could just loop over map(chr, range(768, 880)) without caching
    # in combining_chars, but that increases runtime ~20%
    for combiner in combining_chars:
        normalized = unicodedata.normalize('NFKC', letter + combiner)
        if len(normalized) == 1:
            variations.append(normalized)
    return ''.join(variations)

这样做的好处是,无需尝试在unicodedata DB中手动执行字符串查找,并且不需要对组合字符的所有可能描述进行硬编码。组成单个字符的所有内容都将包括在内;我的机器上的检查运行时间不到50 µs,因此,如果您不经常执行此检查,则成本是合理的(如果您打算使用相同的参数重复调用functools.lru_cache,则可以用functools.lru_cache装饰)希望避免每次都重新计算)。

如果您想从其中一个字符中获取所有内容,可以进行更详尽的搜索,但是会花费更长的时间(import functools import sys import unicodedata @functools.lru_cache(maxsize=None) def get_unicode_variations_exhaustive(letter): if len(letter) != 1: raise ValueError("letter must be a single character to check for variations") variations = [] for testlet in map(chr, range(sys.maxunicode)): if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: variations.append(testlet) return ''.join(variations) 几乎是强制性的,除非它是每个参数只能调用一次):

'L'

这会寻找任何字符,这些字符会分解为包含目标字母的形式;这确实意味着第一次搜索大约需要三分之一的时间,并且搜索结果中包含的内容实际上不仅仅是字符的修改版本(例如的结果将包含'L' ,它实际上不是经过修改的{{1}}),但它已尽其所能。

答案 1 :(得分:1)

我认为您无法以真正通用的方式在Python中做到这一点。

基于以下内容查看字符的“完整列表”是什么:https://unicodelookup.com/

我觉得您会对完整列表中有多少个字符感到惊讶。

不过,您绝对可以构建一个专门针对您的应用程序定制的表。

答案 2 :(得分:1)

我不知道,但是您可以自己建造一个。只需查找特殊字符的开始和结束编号即可。您可以使用unicode character table进行操作。然后为每个字符使用这些数字创建一个列表:

ranges = {
  'A': (192, 199),
  'B': (0, 0),
  'E': (200, 204),
  ...
}

map = {}
for char, rng in ranges.items():
  start, end = rng 
  map[char] = char + ''.join([chr(i) for i in range(start, end)])

这将生成如下地图:

{
  'A': 'AÀÁÂÃÄÅÆ'
  'B': 'B',
  'E': 'EÈÉÊË',
  ...
}

答案 3 :(得分:1)

您可以直接使用Unicode数据库的decomposition mappings。以下代码检查所有映射,以查找字符分解为以某个字母开头的字符:

def get_unicode_variations(letter):
    letter_code = ord(letter)
    # For some characters, you might want to check all
    # code points up to 0x10FFFF
    for i in range(65536):
        decomp = unicodedata.decomposition(chr(i))
        # Mappings starting with '<...>' indicate a
        # compatibility mapping (NFKD, NFKC) which we ignore.
        while decomp != '' and not decomp.startswith('<'):
            first_code = int(decomp.split()[0], 16)
            if first_code == letter_code:
                print(chr(i), unicodedata.name(chr(i)))
                break
            # Try to decompose further
            decomp = unicodedata.decomposition(chr(first_code))

但是,如果要处理多个字符,这效率很低。对于字母athe code above prints

à LATIN SMALL LETTER A WITH GRAVE
á LATIN SMALL LETTER A WITH ACUTE
â LATIN SMALL LETTER A WITH CIRCUMFLEX
ã LATIN SMALL LETTER A WITH TILDE
ä LATIN SMALL LETTER A WITH DIAERESIS
å LATIN SMALL LETTER A WITH RING ABOVE
ā LATIN SMALL LETTER A WITH MACRON
ă LATIN SMALL LETTER A WITH BREVE
ą LATIN SMALL LETTER A WITH OGONEK
ǎ LATIN SMALL LETTER A WITH CARON
ǟ LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
ǡ LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
ǻ LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
ȁ LATIN SMALL LETTER A WITH DOUBLE GRAVE
ȃ LATIN SMALL LETTER A WITH INVERTED BREVE
ȧ LATIN SMALL LETTER A WITH DOT ABOVE
ḁ LATIN SMALL LETTER A WITH RING BELOW
ạ LATIN SMALL LETTER A WITH DOT BELOW
ả LATIN SMALL LETTER A WITH HOOK ABOVE
ấ LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
ầ LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
ẩ LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
ẫ LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
ậ LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
ắ LATIN SMALL LETTER A WITH BREVE AND ACUTE
ằ LATIN SMALL LETTER A WITH BREVE AND GRAVE
ẳ LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
ẵ LATIN SMALL LETTER A WITH BREVE AND TILDE
ặ LATIN SMALL LETTER A WITH BREVE AND DOT BELOW

答案 4 :(得分:0)

使用unichars

› unichars -a | grep -i 'Latin Small Letter A with'
 à  U+000E0 LATIN SMALL LETTER A WITH GRAVE
 á  U+000E1 LATIN SMALL LETTER A WITH ACUTE
 â  U+000E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
 ã  U+000E3 LATIN SMALL LETTER A WITH TILDE
 ä  U+000E4 LATIN SMALL LETTER A WITH DIAERESIS
 å  U+000E5 LATIN SMALL LETTER A WITH RING ABOVE
 ā  U+00101 LATIN SMALL LETTER A WITH MACRON
 ă  U+00103 LATIN SMALL LETTER A WITH BREVE
 ą  U+00105 LATIN SMALL LETTER A WITH OGONEK
 ǎ  U+001CE LATIN SMALL LETTER A WITH CARON
 ǟ  U+001DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
 ǡ  U+001E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
 ǻ  U+001FB LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
 ȁ  U+00201 LATIN SMALL LETTER A WITH DOUBLE GRAVE
 ȃ  U+00203 LATIN SMALL LETTER A WITH INVERTED BREVE
 ȧ  U+00227 LATIN SMALL LETTER A WITH DOT ABOVE
 ᶏ  U+01D8F LATIN SMALL LETTER A WITH RETROFLEX HOOK
 ◌ᷲ  U+01DF2 COMBINING LATIN SMALL LETTER A WITH DIAERESIS
 ḁ  U+01E01 LATIN SMALL LETTER A WITH RING BELOW
 ẚ  U+01E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
 ạ  U+01EA1 LATIN SMALL LETTER A WITH DOT BELOW
 ả  U+01EA3 LATIN SMALL LETTER A WITH HOOK ABOVE
 ấ  U+01EA5 LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
 ầ  U+01EA7 LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
 ẩ  U+01EA9 LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
 ẫ  U+01EAB LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
 ậ  U+01EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
 ắ  U+01EAF LATIN SMALL LETTER A WITH BREVE AND ACUTE
 ằ  U+01EB1 LATIN SMALL LETTER A WITH BREVE AND GRAVE
 ẳ  U+01EB3 LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
 ẵ  U+01EB5 LATIN SMALL LETTER A WITH BREVE AND TILDE
 ặ  U+01EB7 LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
 ⱥ  U+02C65 LATIN SMALL LETTER A WITH STROKE