我建立了自定义词典,现在我想将其映射到我的tweet数据框。我该怎么办?
所以基本上,我有这3本字典。正面,负面和中性的话。我有推特数据集,我想将我的字典映射到该数据集,以确定每条推文的情绪。到目前为止,这是我所做的。
positive='1'
negative='-1'
neutral ='0'
pos_Words=set(['good','beautiful','best',])
neg_Words=set(['bad','suck','damn'])
def sentiment(words):
pslen= len(pos_Words.intersection(words))
nglen= len(neg_Words.intersection(words))
if pslen > nglen:
return positive
elif pslen < nglen:
return negative
else:
return neutral
from collections import Counter
def count_senti(sentences):
sents = Counter()
words = Counter()
for sentence in sentences:
senti = sentiment(sentence)
sents[senti] += 1
words[senti]+= len(sentence)
return sents,words
import nltk
def parse_senti(text):
sentences = [
[word.lower() for word in nltk.word_tokenize(sentence)]
for sentence in nltk.sent_tokenize(text)
]
sents, words = count_senti(sentences)
total = sum(words.values())
for sentiment, count in words.items():
pcent = (count / total) * 100
nsents = sents[sentiment]
print(
pcent,sentiment,nsents
)
parse_senti('good. bad')
结果是 66.66666666666666 1 1 33.33333333333333 -1 1
但是我希望它映射到我用csv编写的twitter数据框中的每个tweet。
有主意吗?
我做到了 parse_senti('dataframe')
发生错误 预期的字符串或类似字节的对象
答案 0 :(得分:0)
我有多傻,解决它。
只需对数据框中的每一行进行迭代,即可解决问题。
df['sentiment'] = df[0].apply(parse_senti)
0 sentiment
0 bad (100.0, -1, 1)
1 good (100.0, 1, 1)