我有一个我想在下面计算的'单词'列表
word_list = ['one','two','three']
我在pandas数据框中有一个列,下面有文字。
TEXT | USER
-------------------------------------------|---------------
"Perhaps she'll be the one for me." | User 1
"Is it two or one?" | User 1
"Mayhaps it be three afterall..." | User 2
"Three times and it's a charm." | User 2
"One fish, two fish, red fish, blue fish." | User 2
"There's only one cat in the hat." | User 3
"One does not simply code into pandas." | User 3
"Two nights later..." | User 1
"Quoth the Raven... nevermore." | User 2
我想要的输出如下所示,我希望使用“TEXT”列中的数据计算word_list中任何单词与文本相关的唯一用户数
Word | Unique User Count
one | 3 User 1/2/3 here
two | 2 User 1/2 here
three| 1 User 2 here
有没有办法在Python 2.7中执行此操作?
答案 0 :(得分:1)
df[word_list]=df.TEXT.apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg([lambda x : ','.join(x),len])
Out[31]:
<lambda> len
level_1
one User 1, User 1, User 3 3
three User 2 1
two User 1, User 2 2
编辑
df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})
Out[50]:
Unique User Count
<lambda> <lambda>
level_1
one 3 User 2, User 3, User 1
three 1 User 2
two 2 User 2, User 1
编辑2
df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
Target=df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})
Target.columns=Target.columns.droplevel(1)
Target.drop('User Count',axis=1).reset_index().rename(columns={'level_1':'Words'})
Out[94]:
Words Unique
0 one 3
1 three 1
2 two 2