基于python的朴素贝叶斯分类器,用于新语言

时间:2014-07-28 05:00:33

标签: python unicode nlp scikit-learn nltk

我不是要建立一个全新的朴素贝叶斯分类器。已经有很多例如scitkit learn有Naive Bayes实现,NLTK有自己的NaiveBayesClassifier

我有1000多个训练句子,300多个句子用于我的语言测试集(一种印度语)。我需要做的就是拿起一个分类器(Naive Bayes实现),训练它并测试它的准确性。

问题是文本在Devnagari unicode中不是英文的。

我正在寻找关于哪种分类器适合掩盖我目前使用unicode的主要问题的建议。

3 个答案:

答案 0 :(得分:4)

scikit-learn中的朴素贝叶斯使用数字向量进行操作,(例如)我们可以在一些向量化器之后进行操作。 对于文本分类,我经常使用TfidfVectorizer:http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

在构造函数的参数中,TfidfVectorizer存在下一个参数: encoding:string,默认为'utf-8'。 如果给出要分析的字节或文件,则使用此编码进行解码。

您可以使用此参数并使用您的编码,您也可以指定自己的预处理器功能和分析功能(它也很有用)

答案 1 :(得分:1)

最好,我会像以前一样使用scikit-learn来构建语言ID系统,请参阅https://github.com/alvations/bayesline

据说完全可以使用NLTK中的简单分类模块和unicode数据构建语言ID系统。

无需对NLTK代码执行任何特殊操作,它们可以按原样使用。(这对于如何在NLTK中构建分类器可能很有用:{{ 3}})

现在要表明完全有可能只使用带有unicode数据的语言ID开箱即用的NLTK,见下文

首先,对于语言ID,在特征提取中使用unicode字符功能和字节码存在细微差别:

from nltk.corpus import indian

# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))

# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')

# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

[OUT]:

hindi: पूर
bangla: মহি
marathi: '' सन
telugu: 4 . ఆడ

hindi: पूर्ण प्रत
bangla: মহিষের সন্
marathi: '' सनातनवा
telugu: 4 . ఆడిట్ 

现在你看到使用字节码和unicode的区别了,让我们训练一个标签器。

from nltk import NaiveBayesClassifier as nbc

# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print

vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))

feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]

classifer = nbc.train(feature_set)

test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu

for testdoc in [test1, test2, test3, test4]:
    featurized_test_sent =  {i:(i in ngrams(testdoc,2)) for i in vocabulary}
    print "test sent:", testdoc
    print "tag:", classifer.classify(featurized_test_sent)
    print

[OUT]:

[(u'4', u' '), (u' ', u'.'), (u'.', u' '), (u' ', u'\u0c06'), (u'\u0c06', u'\u0c21'), (u'\u0c21', u'\u0c3f'), (u'\u0c3f', u'\u0c1f'), (u'\u0c1f', u'\u0c4d'), (u'\u0c4d', u' ')]

[(u'\u092a', u'\u0942', u'\u0930'), (u'\u0942', u'\u0930', u'\u094d'), (u'\u0930', u'\u094d', u'\u0923'), (u'\u094d', u'\u0923', u' '), (u'\u0923', u' ', u'\u092a'), (u' ', u'\u092a', u'\u094d'), (u'\u092a', u'\u094d', u'\u0930'), (u'\u094d', u'\u0930', u'\u0924')]

test sent: पूर्ण प्रत
tag: hi

test sent: মহিষের সন্
tag: ba

test sent: सनातनवा
tag: ma

test sent: ఆడిట్ 
tag: te

这是完整的代码:

# -*- coding: utf-8 -*-

from itertools import chain
from nltk.corpus import indian
from nltk.util import ngrams
from nltk import NaiveBayesClassifier as nbc


# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))

# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')

# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print

vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))

feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]

classifer = nbc.train(feature_set)

test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu

for testdoc in [test1, test2, test3, test4]:
    featurized_test_sent =  {i:(i in ngrams(testdoc,2)) for i in vocabulary}
    print "test sent:", testdoc
    print "tag:", classifer.classify(featurized_test_sent)
    print

答案 2 :(得分:0)

问题的表述很差,但有可能是语言识别而不是句子分类

如果是这种情况,那么在应用Naive Bayes或其他分类器之类的东西之前还有很长的路要走。看看Damir Cavar's LID使用的字符-gram方法,用Python实现。