现在我正在进行文本分类(试图预测Twitter响应是否是人类或机器人生成的)。该任务实际上是一个封闭的讨价还价竞赛,更多细节以及使用的数据集可以在这里找到:enter link description here
我的问题是,当我在网站上提交我的解决方案时,即使我尝试了几个众所周知的解决方案以获得更高的性能,我也无法获得超过50%的准确率。由于这个原因,我认为这个问题在我的代码中甚至可能是一个概念上的错误,甚至在我的案例中也不适合使用tehniques。
到目前为止我尝试了什么:
使用CountVectorizer内置的stop_words。
尝试以极低和极高的频率摆脱功能(我将max_df = 0.3
和min_df = 0.05
参数传递给CountVectorizer对象)
我使用了bi-gram
下面你可以找到我的整个代码(由于复制和粘贴,可以找到一些不良缩进):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import csv
import numpy
from csv import DictReader
target_values=[]
target_values_validation=[]
list_response_train=[]
list_response_validation=[]
predictions=[]
id=[]
with open('train.txt') as f:
reader = DictReader(f, delimiter='\t')
for row in reader:
target_values.append(int(row['human-generated']))
row['response'] = row['response'].replace('@@ ', '')
row['response'] = row['response'].replace('<at>', '')
row['response'] = row['response'].replace('<url>', '')
row['response'] = row['response'].replace('<number>', '')
row['response']=row['response'].replace('<first_speaker>', '')
row['response'] = row['response'].replace('<second_speaker>', '')
row['response'] = row['response'].replace('<third_speaker>', '')
list_response_train.append(row['response'])
y_train=numpy.asarray(target_values)
y_train=y_train[:, numpy.newaxis]
count_vector=CountVectorizer(stop_words='english', ngram_range=(2, 2))
X_train_counts=count_vector.fit_transform(list_response_train)
print(X_train_counts.shape)
print(y_train.shape)
tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tf=tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)
target_names=['chatbox text', 'human text']
clf = MultinomialNB().fit(X_train_tf, y_train)
with open('validation.txt') as f:
reader = DictReader(f, delimiter='\t')
for row in reader:
target_values_validation.append(int(row['human-generated']))
row['response']=row['response'].replace('<first_speaker>', '')
row['response'] = row['response'].replace('<second_speaker>', '')
row['response'] = row['response'].replace('<third_speaker>', '')
row['response'] = row['response'].replace('@@ ', '')
row['response'] = row['response'].replace('<at>', '')
row['response'] = row['response'].replace('<url>', '')
row['response'] = row['response'].replace('<number>', '')
list_response_validation.append(row['response'])
y_validation=numpy.asarray(target_values_validation)
y_validation=y_validation[:, numpy.newaxis]
X_new_counts = count_vector.transform(list_response_validation)
X_new_tfidf = tf_transformer.transform(X_new_counts)
print(X_new_tfidf.shape)
predicted = clf.predict_proba(X_new_tfidf)
print(predicted)
print(predicted.shape)
print(y_validation.shape)
m,n=predicted.shape
for j in range(0, m):
predictions.append(predicted[j][1])
for k in range(0, len(predictions)):
id.append(k)
with open("submit.csv", "a") as f:
writer = csv.writer(f)
for row in zip(id, predictions):
writer.writerow(row)
每个建议都受到高度赞赏。