Question

我正在执行一个NLP任务，该任务需要使用称为Yoruba的语言的语料库。约鲁巴语是一种在字母表中带有变音符号的语言。如果我在python环境中读取任何文本/语料库，则某些上级变音符号会移位/移位，尤其是对于字母ẹ和ọ：

an image description

对于characters顶部带有音素符号的字符displaced，它们会移位。拥有：ẹ́ ẹ̀也会发生ọ同样的事情。（（́ ọ̀）

def readCorpus(directory="news_sites.txt"):
with open(directory, 'r',encoding="utf8", errors='replace') as doc:
data = doc.readlines()
return data

预期结果是将变音符号正确放置在顶部（我很惊讶stackoverflow能够解决变音符号）。

后来变音符号被视为标点符号，因此被删除（通过我的NLP处理功能），从而影响了整个任务。

Answer 1

像noted in the comments一样，您的文本是正确的，只是显示错误（您的控制台，无论它是什么，都不支持Unicode，并且不能正确处理Unicode组合字符）。

在您的情况下，您涉及多个组合变音符号，并且没有一个Unicode序数表示它们全部组合在一起，因此使用unicodedata.normalize('NFC', originalstring)将无济于事（它将组合其中一个组合ọ́的字符，但不能同时使用两个字符。

因此，唯一真正的解决方案是修复标点符号过滤器，以免组合字符掉线。这应该不太困难，只需从以下位置更改过滤代码即可：

if is_punctuation(letter):  # is_punctuation is whatever test you're using to find punctuation
    # drop letter
else:
    # keep letter

to（在文件顶部添加import unicodedata之后）：

if is_punctuation(letter) and not unicodedata.combining(letter):
    # drop letter
else:
    # keep letter

unicodedata.combining returns 0 for combining characters, non-0 otherwise，因此即使使用其他标准将其算作标点符号，您也可以使用它来确保不会删除组合字符。

Answer 2

我使用reg ex'cos删除了数字，有些字符串包含数字，但是使用了字符串库中的maketrans方法来删除标点符号。

import string
out = re.sub(r'[0-9\.]+', '', ins)
punct = str.maketrans({k: None for k in string.punctuation})
new_s = out.translate(punct)