我想通过用词网中的同义词替换单词来进行情感分析任务的数据扩充,但是替换是随机的,我想遍历同义词并同时用所有同义词替换单词以增加数据量
sentences=[]
for index , r in pos_df.iterrows():
text=normalize(r['text'])
words=tokenize(text)
output = ""
# Identify the parts of speech
tagged = nltk.pos_tag(words)
for i in range(0,len(words)):
replacements = []
# Only replace nouns with nouns, vowels with vowels etc.
for syn in wordnet.synsets(words[i]):
# Do not attempt to replace proper nouns or determiners
if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT':
break
# The tokenizer returns strings like NNP, VBP etc
# but the wordnet synonyms has tags like .n.
# So we extract the first character from NNP ie n
# then we check if the dictionary word has a .n. or not
word_type = tagged[i][1][0]
if syn.name().find("."+word_type+"."):
# extract the word only
r = syn.name()[0:syn.name().find(".")]
replacements.append(r)
if len(replacements) > 0:
# Choose a random replacement
replacement = replacements[randint(0,len(replacements)-1)]
print(replacement)
output = output + " " + replacement
else:
# If no replacement could be found, then just use the
# original word
output = output + " " + words[i]
sentences.append([output,'positive'])
答案 0 :(得分:0)
即使我正在使用类似的项目,也可以从给定的输入生成新的句子,但不会更改输入文本的上下文。 遇到此问题时,我发现了一种数据增强技术。这似乎在增强部分上效果很好。 EDA(简易数据增强)是一篇论文[https://github.com/jasonwei20/eda_nlp]。
希望这对您有所帮助。