尝试在文本分类任务中获得更高的性能

时间:2017-08-09 11:55:48

标签: python machine-learning scikit-learn text-classification

现在我正在进行文本分类(试图预测Twitter响应是否是人类或机器人生成的)。该任务实际上是一个封闭的讨价还价竞赛,更多细节以及使用的数据集可以在这里找到:enter link description here

我的问题是,当我在网站上提交我的解决方案时,即使我尝试了几个众所周知的解决方案以获得更高的性能,我也无法获得超过50%的准确率。由于这个原因,我认为这个问题在我的代码中甚至可能是一个概念上的错误,甚至在我的案例中也不适合使用tehniques。

到目前为止我尝试了什么:

  1. 使用CountVectorizer内置的stop_words。

  2. 尝试以极低和极高的频率摆脱功能(我将max_df = 0.3min_df = 0.05参数传递给CountVectorizer对象)

  3. 我使用了bi-gram

  4. 下面你可以找到我的整个代码(由于复制和粘贴,可以找到一些不良缩进):

        from sklearn.feature_extraction.text import CountVectorizer
        from sklearn.feature_extraction.text import TfidfTransformer
        from sklearn.naive_bayes import MultinomialNB
        import csv
        import numpy
        from csv import DictReader
        target_values=[]
        target_values_validation=[]
        list_response_train=[]
        list_response_validation=[]
        predictions=[]
        id=[]
        with open('train.txt') as f:
           reader = DictReader(f, delimiter='\t')
           for row in reader:
              target_values.append(int(row['human-generated']))
              row['response'] = row['response'].replace('@@ ', '')
              row['response'] = row['response'].replace('<at>', '')
              row['response'] = row['response'].replace('<url>', '')
              row['response'] = row['response'].replace('<number>', '')
              row['response']=row['response'].replace('<first_speaker>', '')
              row['response'] = row['response'].replace('<second_speaker>', '')
              row['response'] = row['response'].replace('<third_speaker>', '')
              list_response_train.append(row['response'])
    y_train=numpy.asarray(target_values)
    y_train=y_train[:, numpy.newaxis]
    
    count_vector=CountVectorizer(stop_words='english', ngram_range=(2, 2))
    X_train_counts=count_vector.fit_transform(list_response_train)
    print(X_train_counts.shape)
    print(y_train.shape)
    
    tf_transformer = TfidfTransformer().fit(X_train_counts)
    X_train_tf=tf_transformer.transform(X_train_counts)
    print(X_train_tf.shape)
    target_names=['chatbox text', 'human text']
    
    
    clf = MultinomialNB().fit(X_train_tf, y_train)
    with open('validation.txt') as f:
        reader = DictReader(f, delimiter='\t')
        for row in reader:
            target_values_validation.append(int(row['human-generated']))
            row['response']=row['response'].replace('<first_speaker>', '')
            row['response'] = row['response'].replace('<second_speaker>', '')
            row['response'] = row['response'].replace('<third_speaker>', '')
            row['response'] = row['response'].replace('@@ ', '')
            row['response'] = row['response'].replace('<at>', '')
            row['response'] = row['response'].replace('<url>', '')
            row['response'] = row['response'].replace('<number>', '')
            list_response_validation.append(row['response'])
    y_validation=numpy.asarray(target_values_validation)
    y_validation=y_validation[:, numpy.newaxis]
    
    X_new_counts = count_vector.transform(list_response_validation)
    X_new_tfidf = tf_transformer.transform(X_new_counts)
    print(X_new_tfidf.shape)
    
    predicted = clf.predict_proba(X_new_tfidf)
    
    print(predicted)
    print(predicted.shape)
    print(y_validation.shape)
    
    m,n=predicted.shape
    for j in range(0, m):
        predictions.append(predicted[j][1])
    
    for k in range(0, len(predictions)):
        id.append(k)
    
    
    with open("submit.csv", "a") as f:
        writer = csv.writer(f)
        for row in zip(id, predictions):
            writer.writerow(row)
    

    每个建议都受到高度赞赏。

0 个答案:

没有答案