将否定词之间的所有单词(如不要或从不)和标点符号标记为否定词

时间:2015-10-08 17:10:03

标签: python nlp text-mining regex-negation sentiment-analysis

我正在尝试构建一个正则表达式匹配替换例程,该例程将取消负字和标点之间出现的所有单词,并为它们添加_NEG后缀。

  

例如:

     
    

文字:我不想去那里:它可能很危险。     输出:我不想要_NEG to_NEG go_NEG there_NEG:它可能是危险的

  

我几乎尝试了所有事情,但我失败了。下面是我正在尝试的代码的快照:

regex1 = "(never|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint|n't)(.*)[.:;!?]"                    
regcom = re.compile(regex1)
def tag(text):
    negative = []
    matching = regcom.findall(text)
    if len(matching)==0:
        return(text)
    matching = list(matching[0])
    matching = matching [0] + " " + matching [1]
    matching = matching .split()
    for neg in matching :
        negative.append(neg)
    for neg in negative:
        text = re.sub(neg + '(?!_NEG)', neg + '_NEG ', text)
    return text

尝试上面的代码文字 =“我不想去那里:它可能很危险” 它只能部分工作。如果我将它应用于一般文本,它也会给我带来许多逻辑和语法错误。 任何帮助将非常感激

1 个答案:

答案 0 :(得分:0)

def tag_words(sentence):
    import re
    # up to punctuation as in punct, put tags for words
    # following a negative word
    # find punctuation in the sentence
    punct = re.findall(r'[.:;!?]',sentence)[0]
    # create word set from sentence
    wordSet = { x for x in re.split("[.:;!?, ]",sentence) if x }
    keywordSet = {"don't","never", "nothing", "nowhere", "noone", "none", "not",
                  "hasn't","hadn't","can't","couldn't","shouldn't","won't",
                  "wouldn't","don't","doesn't","didn't","isn't","aren't","ain't"}
    # find negative words in sentence
    neg_words = wordSet & keywordSet
    if neg_words:
        for word in neg_words:
            start_to_w = sentence[:sentence.find(word)+len(word)]
            # put tags to words after the negative word
            w_to_punct =  re.sub(r'\b([A-Za-z\']+)\b',r'\1_NEG',
                               sentence[sentence.find(word)+len(word):sentence.find(punct)])
            punct_to_end = sentence[sentence.find(punct):]
            print(start_to_w + w_to_punct + punct_to_end)
    else:
        print("no negative words found ...")


s1 = "I don't want to go there: it might be dangerous"
tag_words(s1)
# I don't want_NEG to_NEG go_NEG there_NEG: it might be dangerous
s2 = "I want never to go there: it might be dangerous"
tag_words(s2)
# I want never to_NEG go_NEG there_NEG: it might be dangerous
tag_words(s3)
s3 = "I couldn't to go there! it might be dangerous"
# I couldn't to_NEG go_NEG there_NEG! it might be dangerous