Question

如何在字符串中拆分泰米尔语字符？

当我使用preg_match_all('/./u', $str, $results)时，
我得到的字符是“த”，“ம”，“ி”，“ழ”和“்”。

如何获得组合字符“த”，“மி”和“ழ்”？

Answer 1

我认为您应该能够使用the grapheme_extract function来迭代组合字符（技术上称为“字形集群”）。

或者，如果您更喜欢正则表达式方法，我认为您可以使用它：

preg_match_all('/\pL\pM*|./u', $str, $results)

其中\pL表示Unicode“字母”，\pM表示Unicode“标记”。

（免责声明：我没有测试过这些方法。）

Answer 2

如果我正确理解你的问题，你有一个包含代码点的unicode字符串，你想将它转换成一个graphames数组吗？

我正在开发一个开源Python库，为Tamil Language website执行此类任务。

我暂时没有使用PHP，所以我会发布逻辑。您可以查看amuthaa/TamilWord.py file's split_letters() function中的代码。

正如鲁赫所提到的，泰米尔语字母被构建为代码点。

元音（உயிர்எழுத்து），aytham（ஆய்தத்து - ஃ）和所有组合（（உயிர்-மெய்எழுத்து）在'a'栏中（அவரி - 即க，ச，ட，த， ப，ற，ங，ஞ，ண，ந，ம，ன，ய，ர，ள，வ，ழ，ல）各使用一个代码点。
每个辅音都由两个代码点组成：a-组合字母+ pulli。例如。 ப்=ப+்
除了a组合之外的每个组合也由两个代码点组成：a-组合字母+标记：例如பி=ப்+ி，தை=த்+ை

所以，如果你的逻辑是这样的：

initialize an empty array

for each codepoint in word:

    if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array

    otherwise, the codepoint is a marking such as the pulli (i.e. ்) or one of the combination extensions (e.g.  ி or  ை), so append it to the end of the last element of the array

这当然假设您的字符串格式正确，并且您没有连续两个标记之类的内容。

这是Python代码，如果你觉得它有用。如果您想帮助我们将其移植到PHP，请告诉我：

@staticmethod
def split_letters(word=u''):
    """ Returns the graphemes (i.e. the Tamil characters) in a given word as a list """

    # ensure that the word is a valid word
    TamilWord.validate(word)

    # list (which will be returned to user)
    letters = []

    # a tuple of all combination endings and of all அ combinations
    combination_endings = TamilLetter.get_combination_endings()
    a_combinations = TamilLetter.get_combination_column(u'அ').values()

    # loop through each codepoint in the input string
    for codepoint in word:

        # if codepoint is an அ combination, a vowel, aytham or a space,
        # add it to the list
        if codepoint in a_combinations or \
            TamilLetter.is_whitespace(codepoint) or \
            TamilLetter.is_vowel(codepoint) or \
            TamilLetter.is_aytham(codepoint):

            letters.append(codepoint)

        # if codepoint is a combination ending or a pulli ('்'), add it
        # to the end of the previously-added codepoint
        elif codepoint in combination_endings or \
            codepoint == TamilLetter.get_pulli():

            # ensure that at least one character already exists
            if len(letters) > 0:
                letters[-1] = letters[-1] + codepoint

            # otherwise raise an Error. However, validate_word()
            # should catch this
            else:
                raise ValueError("""%s cannot be first character of a word""" % (codepoint))

    return letters

如何在PHP中拆分字符串中的泰米尔语字符

2 个答案: