计算Python中来自Pandas Df的单词出现次数

时间:2019-12-17 01:00:59

标签: python pandas counter

这是我的Pandas数据框的示例,其中包含30,000行(不包括列标题)。该表达式包含两个类,主要是Sad和Happy。

Expression              Description
Sad                     "people are sad because they got no money."
Happy                   "people are happy because ..."
Sad                     "people are miserable because they broke up"
Happy                   "They got good money"

基于上面的示例,我想计算频率的数量,这使我可以计算字典中出现“悲伤”和“快乐”表达的单词的次数。例如{sad:{people:2},happy:{happy:1}}

这是我的代码:

 def calculate_word_frequency(lst, classes):
        #variable
        wordlist = []
        dict_output = {}
        count = 0
        term = ""

data = [lst.columns.values.tolist()] + lst.values.tolist() #to convert into a list


for i in range(1,len(data)):
    if data[i][0] == classes[0]:
        wordlist = data[i][1].lower().split(" ")

        for words in wordlist:
            wordlist.append(words)

            for word in wordlist:
              if word in dict_output:
                dict_output[wordlist] += 1
              else: 
                dict_output[wordlist] == 1
                print(dict_output)

预期的输出将基于每个表达式中出现的单词数。

#Test case:
  words, freqs_per_expression = calculate_word_frequency(social_df, ["Sad", "Happy"])
  #output: 538212

print(freqs_per_class["sad"]["people"]) #output: 203

由于数据集,我经常在VS上遇到挂起和滞后的情况。因此,我无法检索任何结果。我想知道是否有更好的技术可以利用,以便获得所需的{word:count}数据。

谢谢!

2 个答案:

答案 0 :(得分:0)

我们可以通过几个步骤来达到预期的结果。如果您使用的是pandas >= 0.25,则可以使用新的explode函数,否则此解决方案将实现您想要的。

from collections import defaultdict

exploded = df.set_index('Expression') \
             .stack() \
             .str.split(' ', expand=True) \
             .stack() \
             .reset_index() \
             .drop(['level_1', 'level_2'], axis=1) \
             .rename(columns={0: 'Word'})

print(exploded)

   Expression       Word
0         Sad     people
1         Sad        are
2         Sad        sad
3         Sad    because
4         Sad       they
...

counts = pd.DataFrame(exploded.groupby('Expression')['Word'].value_counts()) \
                              .rename(columns={'Word': 'Count'}).reset_index().to_dict('records')

d = defaultdict(dict)

for rec in counts:
    key = rec.get('Expression')
    word = rec.get('Word')
    count = rec.get('Count')
    d[key].update({word: count})

print(d)

defaultdict(dict,
            {'Happy': {'...': 1,
              'They': 1,
              'are': 1,
              'because': 1,
              'good': 1,
              'got': 1,
              'happy': 1,
              'money': 1,
              'people': 1},
             'Sad': {'are': 2,
              'because': 2,
              'broke': 1,
              'got': 1,
              'miserable': 1,
              'money.': 1,
              'no': 1,
              'people': 2,
              'sad': 1,
              'they': 2,
              'up': 1}})

答案 1 :(得分:-2)

这里有个例子,也许可以帮助您完成代码:

from collections import Counter
from io import StringIO
import pandas as pd

data = """
Expression,Description
Sad,"people are sad because they got no money."
Happy,"people are happy because ..really."
Sad,"people are miserable because they broke up"
Happy,"They got good money"
"""
#read csv
df = pd.read_csv(StringIO(data),sep=',')
#Only select result where Expression = 'Sad'
dfToList=df[df['Expression']=='Sad']['Description'].tolist()
# All dict 
print(dict(Counter(" ".join(dfToList).split(" ")).items()))

words=dict(Counter(" ".join(dfToList).split(" ")).items())

for key in words:
  # Here your conditions what you want
  print(key, '->', words[key])

您还可以在多个条件下使用isin()。..快乐...不好...等等:

dfToList=df[df['Expression'].isin(['Bad', 'Happy'])]['Description'].tolist()

输出:

{'people': 2, 'are': 2, 'sad': 1, 'because': 2, 'they': 2, 'got': 1, 'no': 1, 'money.': 1, 'miserable': 1, 'broke': 1, 'up': 1}
people -> 2
are -> 2
sad -> 1
because -> 2
they -> 2
got -> 1
no -> 1
money. -> 1
miserable -> 1
broke -> 1
up -> 1