Question

例如，我有一个看起来像这样的数据框： First Image

我想创建一个新的数据框，该框显示一个单词被标记为垃圾邮件或火腿的次数。我希望它看起来像这样： Second image

我已经尝试使用以下代码列出要测试的单词的仅垃圾邮件数量列表，但它似乎无法正常工作，并且会使Jupyter Notebook上的内核崩溃：

words = []
for word in df["Message"]:
    words.extend(word.split())

sentences = []
for word in df["Message"]:
    sentences.append(word.split())        

spam = []
ham = []

for word in words:
    sc = 0
    hc = 0
    for index,sentence in enumerate(sentences):
        if word in sentence:
            print(word)
            if(df["Category"][index])=="ham":
                hc+=1
            else:
                sc+=1
    spam.append(sc)
spam

df是第一张图片中显示的数据框。我该怎么做呢？

Answer 1

您可以将垃圾邮件和火腿这两个词典组合起来，以存储垃圾邮件/火腿邮件中出现的不同单词的数量。

from collections import defaultdict as dd
spam = dd(int)
ham = dd(int)
for i in range(len(sentences)):
    if df['Category'][i] == 'ham':
        p = sentences[i]
        for x in p:
            ham[x] += 1
    else:
        p = sentences[i]
        for x in p:
            spam[x] += 1

从上面的代码获得的与您类似的输入的输出如下。

>>> spam
defaultdict(<class 'int'>, {'ok': 1, 'lar': 1, 'joking': 1, 'wtf': 1, 'u': 1, 'oni': 1, 'free': 1, 'entry': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1})
>>> ham
defaultdict(<class 'int'>, {'go': 1, 'until': 1, 'jurong': 1, 'crazy': 1, 'available': 1, 'only': 1, 'in': 1, 'u': 1, 'dun': 1, 'say': 1, 's': 1, 'oearly': 1, 'nah': 1, 'I': 1, 'don’t': 1, 'think': 1, 'he': 1, 'goes': 1, 'to': 1, 'usf': 1})

现在可以处理数据并以所需的格式导出数据。

编辑：

answer = []
for x in spam:
    answer.append([x,spam[x],ham[x]])

for x in ham:
    if x not in spam:
        answer.append([x,spam[x],ham[x]])

因此，此处答案列表中的行数等于所有消息中不同单词的数。每行的第一列是我们正在谈论的词，第二列和第三列是该词在垃圾邮件和火腿邮件中的出现次数。

我的代码获得的输出如下。

['ok', 1, 0]
['lar', 1, 0]
['joking', 1, 0]
['wif', 1, 0]
['u', 1, 1]
['oni', 1, 0]
['free', 1, 0]
['entry', 1, 0]
['in', 1, 1]

Answer 2

这样会更好： https://docs.python.org/3.8/library/collections.html#collections.Counter

from collections import Counter
import pandas as pd

df # the data frame in your first image
df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))

def func(df: pd.DataFrame):
    for category, data in df.groupby('Category'):
        count = Counter()
        for var in data.Counter:
            count += var
        cur = pd.DataFrame.from_dict(count, orient='index', columns=[category])
        yield cur

demo = func(df)
df2 = next(demo)
for cur in demo:
    df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

编辑：

from collections import Counter
import pandas as pd

df # the data frame in your first image. Suit both cases(whether it is a slice of the complete data frame or not)
def func(df: pd.DataFrame):
    res = df.groupby('Category').Message.apply(' '.join).str.split().apply(Counter)
    for category, count in res.to_dict().items():
        yield pd.DataFrame.from_dict(count, orient='index', columns=[category])

demo = func(df)
df2 = next(demo)
for cur in demo:
    df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)

基于熊猫中列值的频率计数

2 个答案: