熊猫:如何获取包含值列表的列的唯一值?

时间:2016-09-14 21:55:34

标签: python pandas

考虑以下数据框

df = pd.DataFrame({'name' : [['one two','three four'], ['one'],[], [],['one two'],['three']],
                   'col' : ['A','B','A','B','A','B']})       
df.sort_values(by='col',inplace=True)

df
Out[62]: 
  col                   name
0   A  [one two, three four]
2   A                     []
4   A              [one two]
1   B                  [one]
3   B                     []
5   B                [three]

我想获得一个专栏,跟踪name中每个col组合中df Out[62]: col name unique_list 0 A [one two, three four] [one two, three four] 2 A [] [one two, three four] 4 A [one two] [one two, three four] 1 B [one] [one, three] 3 B [] [one, three] 5 B [three] [one, three] 中包含的所有唯一字符串。

即,预期输出为

[one two, three four]

事实上,对于A组,您可以看到[][one two][one two]中包含的唯一字符串集是df['count_unique']=df.groupby('col')['name'].transform(lambda x: list(pd.Series(x.apply(pd.Series).stack().reset_index(drop=True, level=1).nunique()))) df Out[65]: col name count_unique 0 A [one two, three four] 2 2 A [] 2 4 A [one two] 2 1 B [one] 2 3 B [] 2 5 B [three] 2

我可以使用Pandas : how to get the unique number of values in cells when cells contain lists?获取相应数量的唯一值:

nunique

但将unique替换为http-proxy-host = 192.168.1.21 http-proxy-port = 3690 http-proxy-username = [username] http-proxy-password = [password] 失败。

有什么想法吗? 谢谢!

2 个答案:

答案 0 :(得分:2)

尝试:

uniq_df = df.groupby('col')['name'].apply(lambda x: list(set(reduce(lambda y,z: y+z,x)))).reset_index()
uniq_df.columns = ['col','uniq_list']
pd.merge(df,uniq_df, on='col', how='left')

期望的输出:

  col                   name              uniq_list
0   A  [one two, three four]  [one two, three four]
1   A                     []  [one two, three four]
2   A              [one two]  [one two, three four]
3   B                  [one]           [three, one]
4   B                     []           [three, one]
5   B                [three]           [three, one]

答案 1 :(得分:2)

这是解决方案

df['unique_list'] = df.col.map(df.groupby('col')['name'].sum().apply(np.unique))
    df

enter image description here