执行分组功能后,根据条件检查删除某些行

时间:2019-05-16 21:49:49

标签: python-3.x

我有2列,我希望使用A列进行分组,然后检查B列中该组是否存在三个不同的值;如果没有,则删除整行。

Please check the image for input and output required

在上面的输出中,我必须删除ABC,因为它只有1和2,而我需要至少有1,2和3一次

ColA    ColB

ABC     1
ABC     2
XYZ     1
PQR     1
PQR     2
XYZ     2
XYZ     3
PQR     3
PQR     2
XYZ     1
ABC     2

输出

ColA     ColB

XYZ       1
          2
          3

PQR       1
          2
          3

我尝试使用for,但不起作用

1 个答案:

答案 0 :(得分:0)

data = [ ('ABC', 1),
         ('ABC', 2),
         ('XYZ', 1),
         ('PQR', 1),
         ('PQR', 2),
         ('XYZ', 2),
         ('XYZ', 3),
         ('PQR', 3),
         ('PQR', 2),
         ('XYZ', 1),
         ('ABC', 2)
    ]
#create set dataframe   
data_df = pd.DataFrame(list(data), columns=['col_a', 'col_b'], )


#pull unique column b values, pull a list of all unique integer values. This will be used to figured out which col A values does not contain all of the values
dfSetB = set(list(data_df['col_b']))

#sort by column a
dfSorted = data_df.sort_values('col_a')

#pull unique values from a, will need this for the loop that will filter the data by col A values
dfColAValues = set(list(data_df['col_a']))

#check all col a values to see if it contains all unique values from col b
inclusion_list = []
#work your way through each unique col A entry
for col_item in dfColAValues:
    #filter the data-set based on col_item value
    dfTemp= data_df.loc[data_df['col_a'] == col_item  ]   
    #pull list of unique col B values for that specific col A entry
    dfSetTemp = set(list(dfTemp['col_b'])) 
    #check and see if the list of unique col B values for the entire data-set matches all of the unique col B values for that specific col A entry and if it does, append it to the inclusion list
    if dfSetTemp == dfSetB:
        inclusion_list.append(col_item)

#filter data to only include col a values that contain all unique values from col b and drop duplicates        
dfFinal= data_df.loc[data_df['col_a'].isin(inclusion_list)].drop_duplicates(subset=None, keep='first', inplace=False).sort_values(['col_a', 'col_b'])

输出:

  col_a  col_b
3   PQR      1
4   PQR      2
7   PQR      3
2   XYZ      1
5   XYZ      2
6   XYZ      3