使用Python对数据框的复杂操作进行矢量化

时间:2019-04-22 17:46:10

标签: python pandas numpy

我是Python的新手,正在使用Pandas和NumPy。我有一个数据框df 并且我想找到OZNAKA_PARTIJE的值不唯一的列KLIJENT_ID的值,并删除这样的行。

我尝试尽可能避免循环,但是这里的条件似乎过于复杂,无法使用我所知道的方法。是否可以使用Pandas或NumPy的某些函数编写此代码的矢量化版本?

执行此循环需要很长时间,并以MemoryError结尾。

party_labels = df['OZNAKA_PARTIJE'].unique().tolist()

for i in party_labels:
    extracted_party_label = df.loc[df['OZNAKA_PARTIJE'] == i]

    # check if you can use the drop method below
    if (extracted_party_label[ extracted_party_label['OZNAKA_PARTIJE'] == i ].index.is_unique == False):
        print('Drop method might not work properly')

    # if there exists multiple client ids for given party label
    if (extracted_party_label['KLIJENT_ID'].is_unique == False):
        # delete rows with that party label in the original dataset
        df.drop(df[ df['OZNAKA_PARTIJE'] == i ].index , inplace=True)

更新:已回答!

根据@Chris发布的答案,我想到了这一点。

df2 = df.copy()
gb = df2.groupby('OZNAKA_PARTIJE')['KLIJENT_ID'].nunique()
party_labels = df2['OZNAKA_PARTIJE'].unique().tolist()
mask = gb[df2['OZNAKA_PARTIJE']] == 1
df2 = df2[ mask.values ]

1 个答案:

答案 0 :(得分:0)

如果第三列具有任何值,则可以按第一,第二和第三组进行分组。如果前两个是唯一的,则计数为1。如果有重复,则计数将更多。您可以使用它来创建布尔蒙版,然后以此过滤df。

import pandas as pd

df = pd.DataFrame([[1,2,'a'],[1,2,'b'],[2,3,'c'],[3,4,'d'],[3,8,'e']], columns=['OZNAKA_PARTIJE', 'KLIJENT_ID', 'OTHER'])

df = df.groupby(['OZNAKA_PARTIJE','KLIJENT_ID'])['OTHER'].count() == 1
df = df.reset_index()
df[df['OTHER']==True]


OZNAKA_PARTIJE  KLIJENT_ID  OTHER
2               3           True
3               4           True
3               8           True