我是Python的新手,我需要使用当前的情绪对未知数进行分类。我有一个庞大的数据集(大约一千万)。当我执行朴素贝叶斯时,大约有100万无法分类。情绪包括看跌和看涨。并且它们在训练集中的比例约为1:5。
我试图将整个数据集分成小块。我尝试使用200k测试集,但仍然有39700个TweetMessages无法分类。我还尝试了SMOTE方法,因为训练集不平衡。但是事实证明,它仅影响准确性。我还尝试通过仅读取包含未分类39700消息的CSV文件来对'39700 TweetMessages'进行重新分类。但是事实证明,所有39700个TweetMessages都无法分类。
data = TwitterData_Initialize()
data.initialize("Subset_small.csv")
cleaned_data = TwitterData_Cleansing(data)
cleaned_data.cleanup(TwitterCleanuper())
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(min_df = 5, max_df = 500,
tokenizer=nltk.word_tokenize, stop_words = {'english'})
result = vec.fit_transform(cleaned_data.processed_data['TweetMessage'])
sentiment = cleaned_data.processed_data['Sentiment']
// This is the training data set.
smt = SMOTE()
result, sentiment = smt.fit_sample(result, sentiment)3
// This is a 200K test data set.
test_data = TwitterData_Initialize()
test_data.initialize("Subset_small.csv", is_testing_set=True)
cleaned_test_data = TwitterData_Cleansing(test_data)
cleaned_test_data.cleanup(TwitterCleanuper())
test = vec.transform(cleaned_test_data.processed_data['TweetMessage'])
test
// With the following lines of codes, the test data set should be
classified successfully but there are 39700 Sentiments that are remaining
to "NaN":
Classifier = BernoulliNB()
Classifier.fit(result, sentiment)
Prediction_Sentiment = Classifier.predict(test)
cleaned_test_data.processed_data ['Sentiment'] =
pd.DataFrame(Prediction_Sentiment)
cleaned_test_data.processed_data
['Sentiment'].value_counts(normalize=True).round(3)
Bullish 0.702
Bearish 0.298
Name: Sentiment, dtype: float64
cleaned_test_data.processed_data ['Sentiment'].isnull().sum()
39704