根据B列上的子字符串过滤器计算A列中唯一值的数量

时间:2017-10-24 17:19:45

标签: python-2.7 pandas

我有一个我想在下面计算的'单词'列表

word_list = ['one','two','three']

我在pandas数据框中有一个列,下面有文字。

TEXT                                       | USER
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | User 1
"Is it two or one?"                        | User 1
"Mayhaps it be three afterall..."          | User 2
"Three times and it's a charm."            | User 2
"One fish, two fish, red fish, blue fish." | User 2
"There's only one cat in the hat."         | User 3
"One does not simply code into pandas."    | User 3
"Two nights later..."                      | User 1
"Quoth the Raven... nevermore."            | User 2

我想要的输出如下所示,我希望使用“TEXT”列中的数据计算word_list中任何单词与文本相关的唯一用户数

Word | Unique User Count
one  |      3          User 1/2/3 here
two  |      2          User 1/2 here
three|      1          User 2 here

有没有办法在Python 2.7中执行此操作?

1 个答案:

答案 0 :(得分:1)

df[word_list]=df.TEXT.apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg([lambda x : ','.join(x),len])

Out[31]: 
                        <lambda>  len
level_1                              
one       User 1, User 1, User 3    3
three                     User 2    1
two               User 1, User 2    2

编辑

df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})


Out[50]: 
          Unique               User Count
        <lambda>                 <lambda>
level_1                                  
one            3   User 2, User 3, User 1
three          1                   User 2
two            2           User 2, User 1

编辑2

df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
Target=df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})
Target.columns=Target.columns.droplevel(1)
Target.drop('User Count',axis=1).reset_index().rename(columns={'level_1':'Words'})
Out[94]: 
   Words  Unique
0    one       3
1  three       1
2    two       2