使用python将indic / abugida脚本中的元音和辅音分开

时间:2017-05-17 12:57:20

标签: python unicode python-unicode indic

我正在尝试构建一个程序,帮助我将unicode abugida脚本转换为元音和辅音列表。我使用从Playing around with Devanagari characters

获取的以下脚本实现了语音分离
#!/usr/bin/python
# -*- coding: utf-8 -*-

import unicodedata, sys

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

name_in_indic = raw_input('Enter your name in devanagari: ').decode('utf8')

print (','.join(list(splitclusters(name_in_indic))))

然而,我的目的是进一步分离所有的元音和辅音。

E.g हिंदी = ह+इ+न+द+ई 

这与hindi成为h + i + n + d + i相同仅在每个音素作为角色处理的指示脚本中

我该怎么做?

0 个答案:

没有答案