忽略pandas中的重复值

时间:2017-10-20 23:12:51

标签: python pandas

我正在尝试使用pandas在csv文件中实现简单的投票得分。基本上,如果`dataframe ['C'] == Active和dataframe ['Count'] == 0,那么dataframe ['Combo'] == 0.如果dataframe ['C'] == Active和dataframe ['计数'] == 1;然后dataframe ['Combo'] == 1.如果dataframe ['C'] == Active并且dataframe ['Count'] == 2;然后是dataframe ['Combo'] == 2,依此类推。

这是我的数据框:

A        B          C           Count Combo
Ptn1    Lig1        Inactive    0      
Ptn1    Lig1        Inactive    1      
Ptn1    Lig1        Active      2      2
Ptn2    Lig2        Active      0      0
Ptn2    Lig2        Inactive    1       
Ptn3    Lig3        Active      0      0
Ptn3    Lig3        Inactive    1       
Ptn3    Lig3        Inactive    2       
Ptn3    Lig3        Inactive    3      
Ptn3    Lig3        Active      4      3

到目前为止,这是我的代码,为了清晰起见:

import pandas as pd
df = pd.read_csv('affinity.csv')
VOTE = 0
df['Combo'] = ''
df.loc[(df['Classification] == 'Active') & (df['Count'] == 0), 'Combo'] = VOTE
df.loc[(df['Classification] == 'Active') & (df['Count'] == 1), 'Combo'] = VOTE + 1
df.loc[(df['Classification] == 'Active') & (df['Count'] == 2), 'Combo'] = VOTE + 2
df.loc[(df['Classification] == 'Active') & (df['Count'] > 3), 'Combo'] = VOTE + 3

我的代码能够正确执行此操作。但是,Ptn3-Lig3对有两个“有效”值;一个在dataframe ['Count'] = 0,另一个在dataframe ['Count'] = 4。 有没有办法忽略第二个值(即只考虑最小的数据帧['Count']值)并将相应的数字添加到dataframe ['Combo']? 我知道pandas.DataFrame.drop_duplicates()可能只是一种选择唯一值的方法,但是删除任何行都会很好。

1 个答案:

答案 0 :(得分:1)

您可以执行groupby + apply

def foo(x):
    m = x['C'].eq('Active') 
    if m.any():
       return pd.Series(np.where(m,  x.loc[m, 'Count'].head(1), np.nan))
    else:
       return pd.Series([np.nan] * len(x))

df['Combo'] = df.groupby(['A', 'B'], group_keys=False).apply(foo).values   
print(df) 

      A     B         C  Count Combo
0  Ptn1  Lig1  Inactive      0      
1  Ptn1  Lig1  Inactive      1      
2  Ptn1  Lig1    Active      2     2
3  Ptn2  Lig2    Active      0     0
4  Ptn2  Lig2  Inactive      1      
5  Ptn3  Lig3    Active      0     0
6  Ptn3  Lig3  Inactive      1      
7  Ptn3  Lig3  Inactive      2      
8  Ptn3  Lig3  Inactive      3      
9  Ptn3  Lig3    Active      4     0

groupby + merge的另一种选择:

df = df.groupby(['A', 'B', 'C'])['C', 'Count']\
       .apply(lambda x: x['Count'].values[0] if x['C'].eq('Active').any() else np.nan)\
       .reset_index(name='Combo').fillna('').merge(df)
print(df) 

      A     B         C Combo  Count
0  Ptn1  Lig1    Active     2      2
1  Ptn1  Lig1  Inactive            0
2  Ptn1  Lig1  Inactive            1
3  Ptn2  Lig2    Active     0      0
4  Ptn2  Lig2  Inactive            1
5  Ptn3  Lig3    Active     0      0
6  Ptn3  Lig3    Active     0      4
7  Ptn3  Lig3  Inactive            1
8  Ptn3  Lig3  Inactive            2
9  Ptn3  Lig3  Inactive            3

请注意,最终会对您的群组进行排序。