Question

现在我正在进行文本分类（试图预测Twitter响应是否是人类或机器人生成的）。该任务实际上是一个封闭的讨价还价竞赛，更多细节以及使用的数据集可以在这里找到：enter link description here

我的问题是，当我在网站上提交我的解决方案时，即使我尝试了几个众所周知的解决方案以获得更高的性能，我也无法获得超过50％的准确率。由于这个原因，我认为这个问题在我的代码中甚至可能是一个概念上的错误，甚至在我的案例中也不适合使用tehniques。

到目前为止我尝试了什么：

使用CountVectorizer内置的stop_words。
尝试以极低和极高的频率摆脱功能（我将max_df = 0.3和min_df = 0.05参数传递给CountVectorizer对象）
我使用了bi-gram

下面你可以找到我的整个代码（由于复制和粘贴，可以找到一些不良缩进）：

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB
    import csv
    import numpy
    from csv import DictReader
    target_values=[]
    target_values_validation=[]
    list_response_train=[]
    list_response_validation=[]
    predictions=[]
    id=[]
    with open('train.txt') as f:
       reader = DictReader(f, delimiter='\t')
       for row in reader:
          target_values.append(int(row['human-generated']))
          row['response'] = row['response'].replace('@@ ', '')
          row['response'] = row['response'].replace('<at>', '')
          row['response'] = row['response'].replace('<url>', '')
          row['response'] = row['response'].replace('<number>', '')
          row['response']=row['response'].replace('<first_speaker>', '')
          row['response'] = row['response'].replace('<second_speaker>', '')
          row['response'] = row['response'].replace('<third_speaker>', '')
          list_response_train.append(row['response'])
y_train=numpy.asarray(target_values)
y_train=y_train[:, numpy.newaxis]

count_vector=CountVectorizer(stop_words='english', ngram_range=(2, 2))
X_train_counts=count_vector.fit_transform(list_response_train)
print(X_train_counts.shape)
print(y_train.shape)

tf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tf=tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)
target_names=['chatbox text', 'human text']


clf = MultinomialNB().fit(X_train_tf, y_train)
with open('validation.txt') as f:
    reader = DictReader(f, delimiter='\t')
    for row in reader:
        target_values_validation.append(int(row['human-generated']))
        row['response']=row['response'].replace('<first_speaker>', '')
        row['response'] = row['response'].replace('<second_speaker>', '')
        row['response'] = row['response'].replace('<third_speaker>', '')
        row['response'] = row['response'].replace('@@ ', '')
        row['response'] = row['response'].replace('<at>', '')
        row['response'] = row['response'].replace('<url>', '')
        row['response'] = row['response'].replace('<number>', '')
        list_response_validation.append(row['response'])
y_validation=numpy.asarray(target_values_validation)
y_validation=y_validation[:, numpy.newaxis]

X_new_counts = count_vector.transform(list_response_validation)
X_new_tfidf = tf_transformer.transform(X_new_counts)
print(X_new_tfidf.shape)

predicted = clf.predict_proba(X_new_tfidf)

print(predicted)
print(predicted.shape)
print(y_validation.shape)

m,n=predicted.shape
for j in range(0, m):
    predictions.append(predicted[j][1])

for k in range(0, len(predictions)):
    id.append(k)


with open("submit.csv", "a") as f:
    writer = csv.writer(f)
    for row in zip(id, predictions):
        writer.writerow(row)

每个建议都受到高度赞赏。

尝试在文本分类任务中获得更高的性能

0 个答案: