Question

我一直在与NLTK合作进行一项研究，以对阿拉伯语文本进行标记并对其进行分析。问题是当我执行此代码时：

bsm = 'بسم الله الرحمن الريحم'
wordsBsm = nltk.tokenize.wordpunct_tokenize(anas)
print " ".join(wordsBsm)

我得到了这个我的看法：

� � س� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

我不知道如何解决这个问题！

Answer 1

如果您使用的是Python 2.x，那么就像bobince所说，这应该有效：

bsm = u'بسم الله الرحمن الريحم'

如果您正在使用Python 3.x，那么它应该可以工作，而不必将'u'放在那里。有关详细信息，请查看Python 2's Unicode HOWTO。

Answer 2

此外，如果您正在阅读文件中的阿拉伯文字，您可以这样做：

unicode( open('arabic.txt', 'w').read(), 'utf-8')

或者，取决于您的文件编码：

unicode( open('arabic.txt', 'w').read(), 'Windows-1256')