有关python中字符的unicode表信息

时间:2018-01-02 09:25:02

标签: python unicode glyph

在python中是否有办法获取给定字符的技术信息,就像它在Unicode表中显示一样? (比照https://unicode-table.com/en/

实施例: 字母“Ȅ”

  • 姓名>拉丁语大写字母E与双重坟墓
  • Unicode编号> U + 0204
  • HTML代码> Ȅ
  • Bloc> Latin Extended-B
  • 小写> ȅ

我真正需要的是获取任何Unicode编号(例如此处为U + 0204)相应的名称(拉丁大写字母E和双重坟墓)和小写版本(此处为“ȅ”)。

大致是:
input =一个Unicode编号
输出=相应的信息

我能找到的最接近的东西是fontTools库,但我似乎找不到任何有关如何使用它的教程/文档。

谢谢。

3 个答案:

答案 0 :(得分:4)

标准模块unicodedata定义了很多属性,但不是所有。快速查看its source确认了这一点。

幸运的是unicodedata.txt,这个来自的数据文件并不难解析。每行由15个元素组成,;分开,这使其成为解析的理想选择。使用ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html上元素的描述,您可以创建一些类来封装数据。我从该列表中获取了类元素的名称;每个元素的含义都在同一页面上解释。

请务必先下载ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txtftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt,然后将它们放在与此程序相同的文件夹中。

代码(使用Python 2.7和3.6测试):

# -*- coding: utf-8 -*-

class UnicodeCharacter:
    def __init__(self):
        self.code = 0
        self.name = 'unnamed'
        self.category = ''
        self.combining = ''
        self.bidirectional = ''
        self.decomposition = ''
        self.asDecimal = None
        self.asDigit = None
        self.asNumeric = None
        self.mirrored = False
        self.uc1Name = None
        self.comment = ''
        self.uppercase = None
        self.lowercase = None
        self.titlecase = None
        self.block = None

    def __getitem__(self, item):
        return getattr(self, item)

    def __repr__(self):
        return '{'+self.name+'}'

class UnicodeBlock:
    def __init__(self):
        self.first = 0
        self.last = 0
        self.name = 'unnamed'

    def __repr__(self):
        return '{'+self.name+'}'

class BlockList:
    def __init__(self):
        self.blocklist = []
        with open('Blocks.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' \r\n')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.split(';')
                    block = UnicodeBlock()
                    block.name = rawdata[1].strip()
                    rawdata = rawdata[0].split('..')
                    block.first = int(rawdata[0],16)
                    block.last = int(rawdata[1],16)
                    self.blocklist.append(block)
            # make 100% sure it's sorted, for quicker look-up later
            # (it is usually sorted in the file, but better make sure)
            self.blocklist.sort (key=lambda x: block.first)

    def lookup(self,code):
        for item in self.blocklist:
            if code >= item.first and code <= item.last:
                return item.name
        return None

class UnicodeList:
    """UnicodeList loads Unicode data from the external files
    'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org

    These files must appear in the same directory as this program.

    UnicodeList is a new interpretation of the standard library
    'unicodedata'; you may first want to check if its functionality
    suffices.

    As UnicodeList loads its data from an external file, it does not depend
    on the local build from Python (in which the Unicode data gets frozen
    to the then 'current' version).

    Initialize with

        uclist = UnicodeList()
    """
    def __init__(self):

        # we need this first
        blocklist = BlockList()
        bpos = 0

        self.codelist = []
        with open('UnicodeData.txt','r') as uc_f:
            for line in uc_f:
                line = line.strip(' \r\n')
                if '#' in line:
                    line = line.split('#')[0].strip()
                if line != '':
                    rawdata = line.strip().split(';')
                    parsed = UnicodeCharacter()
                    parsed.code = int(rawdata[0],16)
                    parsed.characterName = rawdata[1]
                    parsed.category = rawdata[2]
                    parsed.combining = rawdata[3]
                    parsed.bidirectional = rawdata[4]
                    parsed.decomposition = rawdata[5]
                    parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
                    parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
                    # the following value may contain a slash:
                    #  ONE QUARTER ... 1/4
                    # let's make it Python 2.7 compatible :)
                    if '/' in rawdata[8]:
                        rawdata[8] = rawdata[8].replace('/','./')
                        parsed.asNumeric = eval(rawdata[8])
                    else:
                        parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
                    parsed.mirrored = rawdata[9] == 'Y'
                    parsed.uc1Name = rawdata[10]
                    parsed.comment = rawdata[11]
                    parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
                    parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
                    parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
                    while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
                        bpos += 1
                    parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
                    self.codelist.append(parsed)

    def find_code(self,codepoint):
        """Find the Unicode information for a codepoint (as int).

        Returns:
            a UnicodeCharacter class object or None.
        """
        # the list is unlikely to contain duplicates but I have seen Unicode.org
        # doing that in similar situations. Again, better make sure.
        val = [x for x in self.codelist if codepoint == x.code]
        return val[0] if val else None

    def find_char(self,str):
        """Find the Unicode information for a codepoint (as character).

        Returns:
            for a single character: a UnicodeCharacter class object or
            None.
            for a multicharacter string: a list of the above, one element
            per character.
        """
        if len(str) > 1:
            result = [self.find_code(ord(x)) for x in str]
            return result
        else:
            return self.find_code(ord(str))

加载后,您现在可以使用

查找字符代码
>>> ul = UnicodeList()     # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}

默认情况下显示为字符的名称(Unicode称之为&#39;代码点&#39;),但您也可以检索其他属性:

>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B

和(只要你没有得到None)甚至将它们链起来:

>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}

它不依赖于您特定的Python构建;您可以随时从unicode.org下载更新列表,并确保获取最新信息:

import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}

(使用Python 3.5.3测试。)

目前定义了两个查找函数:

  • find_code(int)通过 codepoint 查找字符信息为整数。
  • find_char(string)string中查找字符的字符信息。如果只有一个字符,则返回UnicodeCharacter个对象;如果还有更多,则返回列表对象。

import unicodelist之后(假设您将其保存为unicodelist.py),您可以使用

>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'

查找任何字符的十六进制代码和列表理解,例如

>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']

更长的字符串。 请注意,如果您想要的只是字符串的 hex 表示,那么您实际上并不需要所有这些!这就足够了:

 l = [hex(ord(x)) for x in 'Hello']

此模块的目的是让您轻松访问其他 Unicode属性。更长的例子:

str = 'Héllo...'
dest = ''
for i in str:
    dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)

HÉLLO...

并根据您的示例显示角色的属性列表:

letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))

(我遗漏了HTML;这些名称未在Unicode标准中定义。)

答案 1 :(得分:3)

unicodedata documentation显示了如何完成大部分操作。

Unicode块名称显然不可用,只有another Stack Overflow question has a solution of sortsanother has some additional approaches using regex

大写/小写映射和字符编号信息不是特定于Unicode的;只需使用常规的Python字符串函数。

总结

>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'&#203;'
>>> 'Ë'.lower()
'ë'

U+%04X格式是正确的,因为它只是避免填充并打印值大于65,535的代码点的整个十六进制数。请注意,在这种情况下,某些其他格式需要使用%08X填充(在Python中特别是\U00010000格式)。

答案 2 :(得分:-1)

您可以通过以下方式执行此操作:

1 - 自己创建一个API(我找不到任何这样做)
2 - 在数据库或excel文件中创建表
3 - 加载并解析网站

我认为第三种方式非常简单。看一下This Page。你可以在Unicodes找到一些信息。

获取您的Unicode编号,然后使用LXMLScrapySelenium等解析工具在网页中找到它