所以我有一个评论数据集,其中包含诸如
的评论最好。我去年买的。仍在使用。没问题 直到现在为止。惊人的电池寿命。在黑暗或广阔的环境下都能正常工作 日光。给任何书迷的最佳礼物。
(这是来自原始数据集的,我删除了所有标点符号,并在处理后的数据集中使用了所有小写字母)
我想做的是将一些单词替换为1(根据我的词典),而另一些替换为0。 我的字典是
dict = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
我希望输出如下:
0010000000000001000000000100000
我使用了以下代码:
df['newreviews'] = df['reviews'].map(dict).fillna("0")
这总是返回0作为输出。我不想这样做,所以我将1和0作为字符串,但是尽管如此,我得到的结果还是一样。 有什么建议可以解决这个问题吗?
答案 0 :(得分:1)
您可以这样做:
# clean the sentence
import re
sent = re.sub(r'\.','',sent)
# convert to list
sent = sent.lower().split()
# get values from dict using comprehension
new_sent = ''.join([str(1) if x in mydict else str(0) for x in sent])
print(new_sent)
'001100000000000000000000100000'
答案 1 :(得分:1)
首先不要使用script
作为变量名,因为内置了(python保留字),然后将dict
与list comprehension
一起使用,将不匹配的值替换为get
。
通知:
如果数据类似0
-标点符号后不需要空格,请用空格代替。
date.Amazing
df = pd.DataFrame({'reviews':['Simply the best. I bought this last year. Still using. No problems faced till date.Amazing battery life. Works fine in darkness or broad daylight. Best gift for any book lover.']})
d = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
df['reviews'] = df['reviews'].str.replace(r'[^\w\s]+', ' ').str.lower()
替代:
df['newreviews'] = [''.join(d.get(y, '0') for y in x.split()) for x in df['reviews']]
df['newreviews'] = df['reviews'].apply(lambda x: ''.join(d.get(y, '0') for y in x.split()))
答案 2 :(得分:0)
您可以通过
df.replace(repl, regex=True, inplace=True)
其中df
是您的数据帧,repl
是您的字典。