Python初学者:使用python预处理法语文本,并使用词典计算极性

时间:2019-05-22 10:08:52

标签: pandas nlp nltk sentiment-analysis treetagger

我正在用python编写一种算法,该算法处理一列句子,然后给出句子列中每个单元格的极性(正或负)。该脚本使用了NRC情感词典(法语版本)中的否定和肯定单词列表。我在编写预处理功能时遇到问题。我已经编写了计数函数和极性函数,但是由于编写预处理函数有些困难,因此我不确定这些函数是否有效。

肯定词和否定词在同一文件(词典)中,但是我分别导出肯定词和否定词,因为我不知道该如何使用词典。

我的函数计数出现正数和负数不起作用,并且我不知道为什么它总是向我发送0。我在每个句子中添加了正词,因此应该出现在数据框中:

stacktrace:


[4 rows x 6 columns]
   id                                           Verbatim      ...       word_positive  word_negative
0  15  Je n'ai pas bien compris si c'était destiné a ...      ...                   0              0
1  44  Moi aérien affable affaire agent de conservati...      ...                   0              0
2  45  Je affectueux affirmative te hais et la Foret ...      ...                   0              0
3  47  Je absurde accidentel accusateur accuser affli...      ...                   0              0

=>  
def count_occurences_Pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]


    return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))

我的csv_data:第44、45行包含正词,第47行包含更多负词,但在正词和负词列中,其为alwaqys空,该函数不返回词数,最后一列为Always正数,最后一句是负数

id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue 

完整代码:

# -*- coding: UTF-8 -*-
import codecs 
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
    import treetaggerwrapper
    from treetaggerwrapper import TreeTagger, make_tags
    print("import TreeTagger OK")
except:
    print("Import TreeTagger pas Ok")

from itertools import islice
from collections import defaultdict, Counter



csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())

stopWords = set(stopwords.words('french'))  
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')     
def process_text(text):
    '''extract lemma and lowerize then removing stopwords.'''

    text_preprocess =[]
    text_without_stopwords= []

    text = tagger.tag_text(text)
    for word in text:
        parts = word.split('\t')
        try:
            if parts[2] == '':
                text_preprocess.append(parts[1])
            else:
                text_preprocess.append(parts[2])
        except:
            print(parts)


    text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
    return text_without_stopwords

csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)


lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''

    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)


#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)

lexiconneg = open('negative.txt', 'r', encoding='utf-8')

def count_occurences_neg(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)

def polarity_score(text):   
    ''' give the polarity of each text based on the number of positive and negative word '''
    positives_text =count_occurences_pos(text, lexiconpos)
    negatives_text =count_occurences_neg(text, lexiconneg)
    if positives_text > negatives_text :
        return "positive"
    else : 
        return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)

如果您还可以看到其余的代码是否对您有所帮助,则为您提供帮助。

1 个答案:

答案 0 :(得分:1)

我发现了您的错误! 它来自Polarity_score函数。

这只是一个错字: 在您的if语句中,您正在比较count_occurences_Pos and count_occurences_Neg是函数,而不是比较函数count_occurences_pos and count_occurences_peg

的结果

您的代码应如下所示:

def Polarity_score(text):
    ''' give the polarity of each text based on the number of positive and negative word '''
    count_text_pos =count_occurences_Pos(text, word_list)
    count_text_neg =count_occurences_Neg(text, word_list)
    if count_occurences_pos > count_occurences_peg :
        return "Positive"
    else : 
        return "negative"

将来,您需要学习如何为变量使用有意义的名称,以避免此类错误 使用正确的变量名称,您的函数应为:

 def polarity_score(text):
        ''' give the polarity of each text based on the number of positive and negative word '''
        positives_text =count_occurences_pos(text, word_list)
        negatives_text =count_occurences_neg(text, word_list)
        if positives_text > negatives_text :
            return "Positive"
        else : 
            return "negative"

您可以在count_occurences_pos和count_occurences_neg函数中进行的另一项改进是使用set而不是列表。您的文本和world_list可以转换为集合,并且可以使用集合交集检索其中的肯定文本。因为集合比列表快