如何根据某些条件在Pandas Dataframe中查找重复项?

时间:2019-08-21 13:31:07

标签: python pandas python-2.7 dataframe

我有一个熊猫数据框,

RTYPE  PERIOD_ID    STORE_ID                       MKT MTYPE  RGROUP  RZF  RXF
0    MKT   20171411  3102300001  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
1    MKT   20171411  3102300002  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
2    MKT   20171411  3104001193              PM Provision  CELL     NaN  NaN  NaN
3    MKT   20171411  3104001193  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
4    MKT   20171411  3104001193    Provision including MM  CELL     NaN  NaN  NaN
5    MKT   20171411  3104001641              PM Provision  CELL     NaN  NaN  NaN
6    MKT   20171411  3104001641  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
7    MKT   20171411  3104001641    Provision including MM  CELL     NaN  NaN  NaN
8    MKT   20171411  3104001682              PM Provision  CELL     NaN  NaN  NaN
9    MKT   20171411  3104001682  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
10   MKT   20171411  3104001682    Provision including MM  CELL     NaN  NaN  NaN
11   MKT   20171412  3104001682                   Alcohol  CELL     NaN  NaN  NaN
12   MKT   20171412  3104001682                      Fish  CELL     NaN  NaN  NaN
13   MKT   20171412  3104001684                   Alcohol  CELL     NaN  NaN  NaN
14   MKT   20171412  3104001684                      Fish  CELL     NaN  NaN  NaN

我需要根据这种情况找到重复的MKT, 如果商店ID的集合与该特定period_id中的MKT完全相同,则这些MKT是重复的。 所以在这种情况下 20171411期间,重复项是PM准备金和包括MM在内的准备金,以及 在20171412期间,重复项是酒和鱼。

到目前为止,我已经尝试过:-

df1 = newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'], keep=False)]
d1 = {k:tuple(set(v)) for k, v in df1.groupby('PERIOD_ID')['MKT']}
print (d1)

哪个返回:-

{20171411L: ('Provision including MM', 'PM Provision', 'PM KA+PM PROV+SMKT+PETRO'), 20171412L: ('Fish', 'Alcohol')}

以上输出未返回重复项,仅返回该期间的唯一一组MKTS。

我需要的是这样的东西,其中我将周期作为键并将该周期的重复MKT作为值。上面的帖子中提到了重复的条件-

{20171411L: ('Provision including MM', 'PM Provision'), 20171412L: ('Fish', 'Alcohol')}

我真的是Pandas的新手,并且对python有一些基本的了解。 任何帮助都会很棒。

3 个答案:

答案 0 :(得分:0)

我希望我能正确理解您,如果我忘记了某些内容或理解不正确,请随时发表评论。

df_grouped = df.groupby(['PERIOD_ID','STORE_ID','MKT'],
                    as_index=False)\
                    .agg({'MTYPE':'count'})\
                    .rename(columns={'MTYPE': 'count'})

df_grouped[df_grouped['count'] > 1]\
           .groupby('PERIOD_ID')\
           .agg({'MKT':lambda x: list(set(x))}).to_dict()['MKT']

答案 1 :(得分:0)

这应该适合您的情况。我刚刚从发现的重复MKT中删除了存在的唯一MKT。

duplicate = {k:set(v) for k, v in newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'], 
                                                         keep=False)].groupby('PERIOD_ID')['MKT']}
unique = {k:set(v) for k, v in newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'], 
                                                      keep=False) == False].groupby('PERIOD_ID')['MKT']}

final = dict()
for k in duplicate:
    if k in unique:
        final[k] = tuple(duplicate[k] - unique[k])
    else:
        final[k] = tuple(duplicate[k])

print(final)

答案 2 :(得分:0)

我能够使用以下代码解决此问题

    df1=df[['PERIOD_ID','STORE_ID','MKT']]
    df1=df1.sort_values(['PERIOD_ID','STORE_ID'],ascending=True)
    duplicatedf = df1.groupby(['PERIOD_ID','MKT'])['STORE_ID'].agg(lambda STORE_ID: ','.join(STORE_ID.astype(str).replace(' ','').unique())).reset_index()
    duplicates =duplicatedf[ duplicatedf.duplicated(['PERIOD_ID','STORE_ID'],keep='first') | duplicatedf.duplicated(['PERIOD_ID','STORE_ID'],keep='last')]
    duplicates= duplicates.groupby(['PERIOD_ID','STORE_ID']).agg(lambda MKT: ','.join(MKT.astype(str))).reset_index()
    print (duplicates)


#Converting the df into dict
    dupdictdf=duplicates[['PERIOD_ID','MKT']]
    dicta=dupdictdf.to_dict("records")
    print (dicta)