信息功能代码无法正常工作

时间:2017-11-30 01:14:28

标签: python-3.x machine-learning scikit-learn naivebayes

我想在SciKit Learn中为二进制NB实现最具信息性的功能。我正在使用Python3。

首先,我理解为SciKit的多项式NB实现某种“信息功能”功能的问题已被提出。但是,我已经尝试过这些回复并没有运气 - 所以我认为SciKit更新了,或者我做了一些非常错误的事情。我在用 tobigue的answer here用于函数。

from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split



#Array contains a list of (headline, source) tupples where there are two sources. 
#I want to classify each headline as belonging to a given source. 
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]


#I want to classify each headline as belonging to a given source. 
def scikit_naivebayes(data_array):
    headlines = [element[0] for element in data_array]
    sources = [element[1] for element in data_array]
    text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
    cf1 = text_clf.fit(headlines, sources)
    train(cf1,headlines,sources)

    #Call most_informative_features function on CountVectorizer and classifier
    show_most_informative_features(CountVectorizer, cf1)


def train(classifier, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
    classifier.fit(X_train, y_train)
    print ("Accuracy: {}".format(classifier.score(X_test, y_test)))


#tobigue's code: 
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
    print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))


def main():
    scikit_naivebayes(array)


main()

#ERROR: 
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'

1 个答案:

答案 0 :(得分:0)

在致电CountVectorizer之前,您需要符合vectorizer.get_feature_names()。在您的代码中,您只能使用类CountVectorizer调用另一个函数,该函数不会导致任何内容。

您应该从管道中独立尝试使用CountVectorizer创建一个矢量化程序,然后在文本上调用fit,并最终使用已提供的函数,但您应该自己进一步调整它你的问题。

您应该很容易理解您使用的函数需要一个instanciated对象,而不是一个类。告诉我你是不是。

修改

coef_是一个只能由估算器访问的属性,即分类器(而不是全部)。 Pipeline是一个sklearn对象,用于组合不同的步骤以提供分类器。通常,一个词袋管道由一个特征提取器和一个分类器(这里是逻辑回归)构成:

pipeline = Pipeline([
('vectorizer', CountVectorizer(args)),
('classifier', LogisticRegression()
])

因此,在您的情况下,您应该避免使用管道(我建议您开始使用),或者使用管道中的get_params()方法来访问分类器。

我建议你fit_transform文本,然后将转换后的结果提供给逻辑回归或朴素贝叶斯分类器,然后调用你拥有的函数:

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines, sources)
naive_bayes = MultinomialNB()
naive_bayes.fit(X, sources)
show_most_informative_features(vectorizer, naive_bayes)

首先尝试一下,如果它有效,您将更好地了解如何使用管道。请注意,您的管道不应该与特征提取器组合使用,最后一步应该是估算器。如果要堆叠到要素提取器,则需要注意FeatureUnion