我正在尝试构建一个正则表达式匹配替换例程,该例程将取消负字和标点之间出现的所有单词,并为它们添加_NEG后缀。
例如:
文字:我不想去那里:它可能很危险。 输出:我不想要_NEG to_NEG go_NEG there_NEG:它可能是危险的
我几乎尝试了所有事情,但我失败了。下面是我正在尝试的代码的快照:
regex1 = "(never|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint|n't)(.*)[.:;!?]"
regcom = re.compile(regex1)
def tag(text):
negative = []
matching = regcom.findall(text)
if len(matching)==0:
return(text)
matching = list(matching[0])
matching = matching [0] + " " + matching [1]
matching = matching .split()
for neg in matching :
negative.append(neg)
for neg in negative:
text = re.sub(neg + '(?!_NEG)', neg + '_NEG ', text)
return text
尝试上面的代码文字 =“我不想去那里:它可能很危险” 它只能部分工作。如果我将它应用于一般文本,它也会给我带来许多逻辑和语法错误。 任何帮助将非常感激
答案 0 :(得分:0)
def tag_words(sentence):
import re
# up to punctuation as in punct, put tags for words
# following a negative word
# find punctuation in the sentence
punct = re.findall(r'[.:;!?]',sentence)[0]
# create word set from sentence
wordSet = { x for x in re.split("[.:;!?, ]",sentence) if x }
keywordSet = {"don't","never", "nothing", "nowhere", "noone", "none", "not",
"hasn't","hadn't","can't","couldn't","shouldn't","won't",
"wouldn't","don't","doesn't","didn't","isn't","aren't","ain't"}
# find negative words in sentence
neg_words = wordSet & keywordSet
if neg_words:
for word in neg_words:
start_to_w = sentence[:sentence.find(word)+len(word)]
# put tags to words after the negative word
w_to_punct = re.sub(r'\b([A-Za-z\']+)\b',r'\1_NEG',
sentence[sentence.find(word)+len(word):sentence.find(punct)])
punct_to_end = sentence[sentence.find(punct):]
print(start_to_w + w_to_punct + punct_to_end)
else:
print("no negative words found ...")
s1 = "I don't want to go there: it might be dangerous"
tag_words(s1)
# I don't want_NEG to_NEG go_NEG there_NEG: it might be dangerous
s2 = "I want never to go there: it might be dangerous"
tag_words(s2)
# I want never to_NEG go_NEG there_NEG: it might be dangerous
tag_words(s3)
s3 = "I couldn't to go there! it might be dangerous"
# I couldn't to_NEG go_NEG there_NEG! it might be dangerous