例如,我有一个看起来像这样的数据框: First Image
我想创建一个新的数据框,该框显示一个单词被标记为垃圾邮件或火腿的次数。我希望它看起来像这样: Second image
我已经尝试使用以下代码列出要测试的单词的仅垃圾邮件数量列表,但它似乎无法正常工作,并且会使Jupyter Notebook上的内核崩溃:
words = []
for word in df["Message"]:
words.extend(word.split())
sentences = []
for word in df["Message"]:
sentences.append(word.split())
spam = []
ham = []
for word in words:
sc = 0
hc = 0
for index,sentence in enumerate(sentences):
if word in sentence:
print(word)
if(df["Category"][index])=="ham":
hc+=1
else:
sc+=1
spam.append(sc)
spam
df
是第一张图片中显示的数据框。
我该怎么做呢?
答案 0 :(得分:1)
您可以将垃圾邮件和火腿这两个词典组合起来,以存储垃圾邮件/火腿邮件中出现的不同单词的数量。
from collections import defaultdict as dd
spam = dd(int)
ham = dd(int)
for i in range(len(sentences)):
if df['Category'][i] == 'ham':
p = sentences[i]
for x in p:
ham[x] += 1
else:
p = sentences[i]
for x in p:
spam[x] += 1
从上面的代码获得的与您类似的输入的输出如下。
>>> spam
defaultdict(<class 'int'>, {'ok': 1, 'lar': 1, 'joking': 1, 'wtf': 1, 'u': 1, 'oni': 1, 'free': 1, 'entry': 1, 'in': 1, '2': 1, 'a': 1, 'wkly': 1, 'comp': 1})
>>> ham
defaultdict(<class 'int'>, {'go': 1, 'until': 1, 'jurong': 1, 'crazy': 1, 'available': 1, 'only': 1, 'in': 1, 'u': 1, 'dun': 1, 'say': 1, 's': 1, 'oearly': 1, 'nah': 1, 'I': 1, 'don’t': 1, 'think': 1, 'he': 1, 'goes': 1, 'to': 1, 'usf': 1})
现在可以处理数据并以所需的格式导出数据。
编辑:
answer = []
for x in spam:
answer.append([x,spam[x],ham[x]])
for x in ham:
if x not in spam:
answer.append([x,spam[x],ham[x]])
因此,此处答案列表中的行数等于所有消息中不同单词的数。每行的第一列是我们正在谈论的词,第二列和第三列是该词在垃圾邮件和火腿邮件中的出现次数。
我的代码获得的输出如下。
['ok', 1, 0]
['lar', 1, 0]
['joking', 1, 0]
['wif', 1, 0]
['u', 1, 1]
['oni', 1, 0]
['free', 1, 0]
['entry', 1, 0]
['in', 1, 1]
答案 1 :(得分:0)
这样会更好: https://docs.python.org/3.8/library/collections.html#collections.Counter
from collections import Counter
import pandas as pd
df # the data frame in your first image
df['Counter'] = df.Message.apply(lambda x: Counter(x.split()))
def func(df: pd.DataFrame):
for category, data in df.groupby('Category'):
count = Counter()
for var in data.Counter:
count += var
cur = pd.DataFrame.from_dict(count, orient='index', columns=[category])
yield cur
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)
编辑:
from collections import Counter
import pandas as pd
df # the data frame in your first image. Suit both cases(whether it is a slice of the complete data frame or not)
def func(df: pd.DataFrame):
res = df.groupby('Category').Message.apply(' '.join).str.split().apply(Counter)
for category, count in res.to_dict().items():
yield pd.DataFrame.from_dict(count, orient='index', columns=[category])
demo = func(df)
df2 = next(demo)
for cur in demo:
df2 = df2.merge(cur, how='outer', left_index=True, right_index=True)