我是Python的新手,正在使用Pandas和NumPy。我有一个数据框df
并且我想找到OZNAKA_PARTIJE
的值不唯一的列KLIJENT_ID
的值,并删除这样的行。
我尝试尽可能避免循环,但是这里的条件似乎过于复杂,无法使用我所知道的方法。是否可以使用Pandas或NumPy的某些函数编写此代码的矢量化版本?
执行此循环需要很长时间,并以MemoryError结尾。
party_labels = df['OZNAKA_PARTIJE'].unique().tolist()
for i in party_labels:
extracted_party_label = df.loc[df['OZNAKA_PARTIJE'] == i]
# check if you can use the drop method below
if (extracted_party_label[ extracted_party_label['OZNAKA_PARTIJE'] == i ].index.is_unique == False):
print('Drop method might not work properly')
# if there exists multiple client ids for given party label
if (extracted_party_label['KLIJENT_ID'].is_unique == False):
# delete rows with that party label in the original dataset
df.drop(df[ df['OZNAKA_PARTIJE'] == i ].index , inplace=True)
更新:已回答!
根据@Chris发布的答案,我想到了这一点。
df2 = df.copy()
gb = df2.groupby('OZNAKA_PARTIJE')['KLIJENT_ID'].nunique()
party_labels = df2['OZNAKA_PARTIJE'].unique().tolist()
mask = gb[df2['OZNAKA_PARTIJE']] == 1
df2 = df2[ mask.values ]
答案 0 :(得分:0)
如果第三列具有任何值,则可以按第一,第二和第三组进行分组。如果前两个是唯一的,则计数为1。如果有重复,则计数将更多。您可以使用它来创建布尔蒙版,然后以此过滤df。
import pandas as pd
df = pd.DataFrame([[1,2,'a'],[1,2,'b'],[2,3,'c'],[3,4,'d'],[3,8,'e']], columns=['OZNAKA_PARTIJE', 'KLIJENT_ID', 'OTHER'])
df = df.groupby(['OZNAKA_PARTIJE','KLIJENT_ID'])['OTHER'].count() == 1
df = df.reset_index()
df[df['OTHER']==True]
OZNAKA_PARTIJE KLIJENT_ID OTHER
2 3 True
3 4 True
3 8 True