Question

这是我第一次使用ML和Python进行文档分类。

我首先查询我的数据库，提取5000篇与洗钱有关的文章并将其转换为pandas df
然后我提取了500篇与洗钱无关的文章，并将它们转换为pandas df
我将两个dfs连接起来并将它们标记为“洗钱”和“洗钱”。或者＆＃39;其他＆＃39;
我做预处理（删除标点符号和停用词，小写等）

然后根据词袋原则提供模型如下：

vectorizer = CountVectorizer(analyzer = "word",   
                     tokenizer = None,    
                     preprocessor = None, 
                     stop_words = None,   
                     max_features = 5000) 

text_features = vectorizer.fit_transform(full_df["processed full text"])    
text_features = text_features.toarray()    
labels = np.array(full_df['category'])
X_train, X_test, y_train, y_test = train_test_split(text_features, labels, test_size=0.33)    
forest = RandomForestClassifier(n_estimators = 100)     
forest = forest.fit(X_train, y_train)    
y_pred = forest.predict(X_test)    
accuracy_score(y_pred=y_pred, y_true=y_test)

到目前为止它工作正常（即使给我99％的准确度太高）。但我现在想在一个全新的文本文档上测试它。如果我将其矢量化并执行forest.predict(test)，它显然会说：

ValueError: Number of features of the model must  match the input. Model n_features is 5000 and  input n_features is 45

我不知道如何克服这个问题才能对全新的文章进行分类。

Answer 1

首先，尽管我的命题可行，但我强烈强调这个解决方案在运行此代码之前需要了解的一些统计和计算结果。假设您有一个初始语料库full_df["processed full text"]，而test是您要测试的新文本。然后，使用full_added和full_df定义test文本语料库。

text_features = vectorizer.fit_transform(full_added)    
text_features = text_features.toarray()

您可以使用full_df作为火车集（X_train = full_df["processed full text"]和y_train = np.array(full_df['category'])）。然后你可以运行

forest = RandomForestClassifier(n_estimators = 100)     
forest = forest.fit(X_train, y_train)    
y_pred = forest.predict(test)

当然，在此解决方案中，您已经定义了参数，并且您认为您的模型在新数据上是健壮的。

另一个评论是，如果你想要分析一个新文本流作为输入，那么这个解决方案将是可怕的，因为计算新vectorizer.fit_transform(full_added)的计算时间会急剧增加。

我希望它有所帮助。

Answer 2

Naive Bayes的第一个实现来自Text Blob库。它非常慢，我的机器最终耗尽了内存。

第二次尝试基于这篇文章http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html，并使用了sklearn.naive_bayes库中的MultinomialNB。它的作品很有魅力：

#initialize vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 5000)
counts = count_vectorizer.fit_transform(df['processed full text'].values)
targets = df['category'].values

#divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.33)

#create classifier
classifier = MultinomialNB()

classifier.fit(X_train, y_train)

#check accuracy
y_pred = classifier.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)

#check on completely new example
new_counts = count_vectorizer.transform([processed_test_string])
prediction = classifier.predict(new_counts)
prediction

输出：

array(['money laundering'], 
      dtype='<U16')

准确率约为91％，比99.96％更真实。

正是我想要的。也很高兴看到最丰富的功能，我会尝试解决它。谢谢大家。

分类新文件 - 随机森林，一袋字

2 个答案: