我正在尝试计算频率是单词在数据框中的列表。
data = {'H':[['the', 'brown', 'fox'], ['the', 'weather', 'is'],['she', 'sells', 'sea']], 'marks':['a', 'b', 'c']}
df = pd.DataFrame(data)
我想根据标记是a,b,c时分开字数。我知道我可以制作x3个单独的数据帧,但是我正在寻找更简洁的代码输出
freq = {}
def count_freq(word):
for w in word:
if w in list(freq.keys()):
freq[w] += 1
else:
freq[w] = 1
df.H.apply(count_freq)
然后我尝试了一下,但是我搞砸了
df['marks'] = z.apply(lambda row: 0 if row['marks'] in ("a")
else if row['marks'] in ("b")
else row['marks'] in ("c")
编辑:预期结果
Frequency-a Frequency-b Frequency-c
the 1 1
quick 1
brown 1
fox 1
she 1
sells 1
sea 1
weather 1
is 1
答案 0 :(得分:2)
您可以使用get_dummies
并转置结果:
df['H'].str.join(',').str.get_dummies(sep=',').set_index(df['marks']).T
marks a b c
brown 1 0 0
fox 1 0 0
is 0 1 0
sea 0 0 1
sells 0 0 1
she 0 0 1
the 1 1 0
weather 0 1 0
答案 1 :(得分:2)
来自sklearn
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
print (pd.DataFrame(mlb.fit_transform(df['H'].values),columns=mlb.classes_, index=df.marks).T)
marks a b c
brown 1 0 0
fox 1 0 0
is 0 1 0
sea 0 0 1
sells 0 0 1
she 0 0 1
the 1 1 0
weather 0 1 0
答案 2 :(得分:2)
您可以unnest
和crosstab
:
u = unnesting(df, 'H')
pd.crosstab(u.H, u.marks)
marks a b c
H
brown 1 0 0
fox 1 0 0
is 0 1 0
sea 0 0 1
sells 0 0 1
she 0 0 1
the 1 1 0
weather 0 1 0