我有一个熊猫数据框,
RTYPE PERIOD_ID STORE_ID MKT MTYPE RGROUP RZF RXF
0 MKT 20171411 3102300001 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
1 MKT 20171411 3102300002 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
2 MKT 20171411 3104001193 PM Provision CELL NaN NaN NaN
3 MKT 20171411 3104001193 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
4 MKT 20171411 3104001193 Provision including MM CELL NaN NaN NaN
5 MKT 20171411 3104001641 PM Provision CELL NaN NaN NaN
6 MKT 20171411 3104001641 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
7 MKT 20171411 3104001641 Provision including MM CELL NaN NaN NaN
8 MKT 20171411 3104001682 PM Provision CELL NaN NaN NaN
9 MKT 20171411 3104001682 PM KA+PM PROV+SMKT+PETRO CELL NaN NaN NaN
10 MKT 20171411 3104001682 Provision including MM CELL NaN NaN NaN
11 MKT 20171412 3104001682 Alcohol CELL NaN NaN NaN
12 MKT 20171412 3104001682 Fish CELL NaN NaN NaN
13 MKT 20171412 3104001684 Alcohol CELL NaN NaN NaN
14 MKT 20171412 3104001684 Fish CELL NaN NaN NaN
我需要根据这种情况找到重复的MKT, 如果商店ID的集合与该特定period_id中的MKT完全相同,则这些MKT是重复的。 所以在这种情况下 20171411期间,重复项是PM准备金和包括MM在内的准备金,以及 在20171412期间,重复项是酒和鱼。
到目前为止,我已经尝试过:-
df1 = newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'], keep=False)]
d1 = {k:tuple(set(v)) for k, v in df1.groupby('PERIOD_ID')['MKT']}
print (d1)
哪个返回:-
{20171411L: ('Provision including MM', 'PM Provision', 'PM KA+PM PROV+SMKT+PETRO'), 20171412L: ('Fish', 'Alcohol')}
以上输出未返回重复项,仅返回该期间的唯一一组MKTS。
我需要的是这样的东西,其中我将周期作为键并将该周期的重复MKT作为值。上面的帖子中提到了重复的条件-
{20171411L: ('Provision including MM', 'PM Provision'), 20171412L: ('Fish', 'Alcohol')}
我真的是Pandas的新手,并且对python有一些基本的了解。 任何帮助都会很棒。
答案 0 :(得分:0)
我希望我能正确理解您,如果我忘记了某些内容或理解不正确,请随时发表评论。
df_grouped = df.groupby(['PERIOD_ID','STORE_ID','MKT'],
as_index=False)\
.agg({'MTYPE':'count'})\
.rename(columns={'MTYPE': 'count'})
df_grouped[df_grouped['count'] > 1]\
.groupby('PERIOD_ID')\
.agg({'MKT':lambda x: list(set(x))}).to_dict()['MKT']
答案 1 :(得分:0)
这应该适合您的情况。我刚刚从发现的重复MKT中删除了存在的唯一MKT。
duplicate = {k:set(v) for k, v in newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'],
keep=False)].groupby('PERIOD_ID')['MKT']}
unique = {k:set(v) for k, v in newdf[newdf.duplicated(['PERIOD_ID','STORE_ID'],
keep=False) == False].groupby('PERIOD_ID')['MKT']}
final = dict()
for k in duplicate:
if k in unique:
final[k] = tuple(duplicate[k] - unique[k])
else:
final[k] = tuple(duplicate[k])
print(final)
答案 2 :(得分:0)
我能够使用以下代码解决此问题
df1=df[['PERIOD_ID','STORE_ID','MKT']]
df1=df1.sort_values(['PERIOD_ID','STORE_ID'],ascending=True)
duplicatedf = df1.groupby(['PERIOD_ID','MKT'])['STORE_ID'].agg(lambda STORE_ID: ','.join(STORE_ID.astype(str).replace(' ','').unique())).reset_index()
duplicates =duplicatedf[ duplicatedf.duplicated(['PERIOD_ID','STORE_ID'],keep='first') | duplicatedf.duplicated(['PERIOD_ID','STORE_ID'],keep='last')]
duplicates= duplicates.groupby(['PERIOD_ID','STORE_ID']).agg(lambda MKT: ','.join(MKT.astype(str))).reset_index()
print (duplicates)
#Converting the df into dict
dupdictdf=duplicates[['PERIOD_ID','MKT']]
dicta=dupdictdf.to_dict("records")
print (dicta)