Question

我必须阅读一篇文本文档，其中包含Python中的英语和非英语（特别是马拉雅拉姆语）语言。以下我看到：

>>>text_english = 'Today is a good day'
>>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'

现在，如果我编写代码以使用

提取第一个字母

>>>print(text_english[0])
'T'

当我跑

时

>>>print(text_non_english[0])
�

要获得第一个字母，我必须写下以下内容

>>>print(text_non_english[0:3])
ആ

为什么会这样？我的目的是提取文本中的单词，以便我可以将其输入到tfidf变换器。当我用马拉雅拉姆语创建tfidf词汇时，有两个字母是不正确的。实际上它们是完整单词的一部分。我该怎么办才能使tfidf变换器将完整的Malayalam单词用于转换，而不是取两个字母。

我使用了以下代码

>>>useful_text_1[1:3] # contains both English and Malayalam text

>>>vectorizer = TfidfVectorizer(sublinear_tf=True,max_df=0.5,stop_words='english')

# Learn vocabulary and idf, return term-document matrix
>>>vect_2 = vectorizer.fit_transform(useful_text_1[1:3])
>>>vectorizer.vocabulary_

词汇表中的一些词语如下：

ഷമ
സന
സഹ
ർക
ർത

词汇表不正确。它没有考虑整个词。如何纠正这个？

Answer 1

您必须在utf-8中编码文本。但是马拉雅拉姆语的字母包含3个符号，所以你需要使用unicode函数：

In[36]: tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
In[37]: tne=unicode(tn, encoding='utf-8')
In[38]: print(tne[0])
ആ

Answer 2

使用虚拟标记器实际上为我工作

vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(), min_df=1)

>>> tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
>>> vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(),min_df=1)
>>> vect_2 = vectorizer.fit_transform(tn.split())
>>> for x in vectorizer.vocabulary_:
...     print x
... 
സന്തോഷമാഗ്രഹിക്കാത്തത
ആരാണു
>>>

我可以在scikit中使用TfidfVectorizer - 学习非英语语言吗？另外我如何阅读Python中的非英文文本？

2 个答案: