将自定义词典映射到tweet数据框

时间:2019-07-21 22:29:41

标签: dictionary mapping nltk lexicon

我建立了自定义词典,现在我想将其映射到我的tweet数据框。我该怎么办?

所以基本上,我有这3本字典。正面,负面和中性的话。我有推特数据集,我想将我的字典映射到该数据集,以确定每条推文的情绪。到目前为止,这是我所做的。

positive='1'
negative='-1'
neutral ='0'

pos_Words=set(['good','beautiful','best',])
neg_Words=set(['bad','suck','damn'])

def sentiment(words):
    pslen= len(pos_Words.intersection(words))
    nglen= len(neg_Words.intersection(words))

    if pslen > nglen:
        return positive
    elif pslen < nglen:
        return negative
    else:
        return neutral

from collections import Counter

def count_senti(sentences):
    sents = Counter()
    words = Counter()

    for sentence in sentences:
        senti = sentiment(sentence)
        sents[senti] += 1
        words[senti]+= len(sentence)
    return sents,words

import nltk
def parse_senti(text):

    sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence)]
        for sentence in nltk.sent_tokenize(text)
    ]

    sents, words = count_senti(sentences)
    total = sum(words.values())

    for sentiment, count in words.items():
        pcent = (count / total) * 100
        nsents = sents[sentiment]

        print(
            pcent,sentiment,nsents
        )

parse_senti('good. bad')

结果是 66.66666666666666 1 1 33.33333333333333 -1 1

但是我希望它映射到我用csv编写的twitter数据框中的每个tweet。

有主意吗?

我做到了 parse_senti('dataframe')

发生错误 预期的字符串或类似字节的对象

1 个答案:

答案 0 :(得分:0)

我有多傻,解决它。

只需对数据框中的每一行进行迭代,即可解决问题。

df['sentiment'] = df[0].apply(parse_senti)


   0          sentiment

0   bad     (100.0, -1, 1)
1   good    (100.0, 1, 1)