Question

我是Python的新手，我需要使用当前的情绪对未知数进行分类。我有一个庞大的数据集（大约一千万）。当我执行朴素贝叶斯时，大约有100万无法分类。情绪包括看跌和看涨。并且它们在训练集中的比例约为1：5。

我试图将整个数据集分成小块。我尝试使用200k测试集，但仍然有39700个TweetMessages无法分类。我还尝试了SMOTE方法，因为训练集不平衡。但是事实证明，它仅影响准确性。我还尝试通过仅读取包含未分类39700消息的CSV文件来对'39700 TweetMessages'进行重新分类。但是事实证明，所有39700个TweetMessages都无法分类。

data = TwitterData_Initialize()
data.initialize("Subset_small.csv")

cleaned_data = TwitterData_Cleansing(data)
cleaned_data.cleanup(TwitterCleanuper())

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(min_df = 5, max_df = 500, 
tokenizer=nltk.word_tokenize, stop_words = {'english'})
result = vec.fit_transform(cleaned_data.processed_data['TweetMessage'])

sentiment = cleaned_data.processed_data['Sentiment']

// This is the training data set. 
smt = SMOTE()
result, sentiment = smt.fit_sample(result, sentiment)3

// This is a 200K test data set.
test_data = TwitterData_Initialize()
test_data.initialize("Subset_small.csv", is_testing_set=True)

cleaned_test_data = TwitterData_Cleansing(test_data)
cleaned_test_data.cleanup(TwitterCleanuper())

test = vec.transform(cleaned_test_data.processed_data['TweetMessage'])
test

// With the following lines of codes, the test data set should be 
classified successfully but there are 39700 Sentiments that are remaining 
to "NaN":
Classifier = BernoulliNB()
Classifier.fit(result, sentiment)
Prediction_Sentiment = Classifier.predict(test)
cleaned_test_data.processed_data ['Sentiment'] = 
pd.DataFrame(Prediction_Sentiment)

cleaned_test_data.processed_data 
['Sentiment'].value_counts(normalize=True).round(3)
 Bullish    0.702
 Bearish    0.298
 Name: Sentiment, dtype: float64

cleaned_test_data.processed_data ['Sentiment'].isnull().sum()
 39704

朴素贝叶斯分类器未对所有NaN情绪进行分类。这是为什么？

0 个答案: