我在csv文件(或txt文件)中有一个语料库(30,000个客户评论)。这意味着每个客户评论都是文本文件中的一行。一些例子是:
我想将这些文本更改为以下内容:
我有两个单独的积极词和负词的列表(词典)。例如,文本文件包含如下正面词:
并且,文本文件包含如下否定词:
所以,我想要读取客户评论的Python脚本:当找到任何正面词时,然后插入" POSITIVE"在积极的一词之后;当找到任何否定词时,插入" NEGATIVE"在积极的一词之后。
这是我到目前为止测试的代码。这有效(请参阅下面的代码中的我的评论),但需要改进以满足上述需求。
具体来说,my_escaper
有效(这段代码找到便宜又好的单词并用便宜的POSITIVE和好的POSITIVE替换它们),但问题是我有两个文件(词典),每个包含大约一千个正面/否定词。所以我想要的是,代码从词典中读取这些单词列表,在语料库中搜索它们,并在语料库中替换这些单词(例如,从"良好"到#34;良好的积极性&# 34;,从"坏"到"坏的负面")。
#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")
my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))
d = []
with open("review.txt","r") as file:
for line in file:
review = line.strip()
d.append(review)
for line in d:
print my_escaper(line)
答案 0 :(得分:1)
对此进行编码的直接方法是将词汇中的正面和负面词汇加载到单独的集合中。然后,对于每个评论,将句子拆分为单词列表并查找情绪集中的每个单词。检查集合成员身份是O(1) in the average case。将情绪标签(如果有)插入单词列表,然后加入以构建最终字符串。
示例:
import re
reviews = [
"This bike is amazing, but the brake is very poor",
"This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
"The food was awesome, but the water was very rude"
]
positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])
for sentence in reviews:
tagged = []
for word in re.split('\W+', sentence):
tagged.append(word)
if word.lower() in positive_words:
tagged.append("POSITIVE")
elif word.lower() in negative_words:
tagged.append("NEGATIVE")
print ' '.join(tagged)
虽然这种方法很简单,但有一个缺点:由于使用了re.split()
而丢失了标点符号。
答案 1 :(得分:0)
如果我理解正确,你需要这样的东西:
if word in POSITIVE_LIST:
pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
pattern.sub(replacement_function, word+" NEGATIVE")
你可以吗?