Python:使用两个特定条件删除数据框的行,并保留其余条件

时间:2020-11-07 10:56:29

标签: python pandas dataframe duplicates rows

让我们说我有这个数据框:

import pandas as pd

Name = ['ID', 'Country', 'IBAN','ID_bal_amt', 'ID_bal_time','Dan_city','ID_bal_mod','Dan_country','ID_bal_type', 'ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ,'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country','ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12','June','Berlin','OPBD', '55','CRDT','432', 'August', 'CLBD','DBT', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP','432','March','FABD','CRDT']
Ccy = ['','','','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','USD','','CHF', '','DKN','','','USD','CHF']
Group = ['0','0','0','1','1','1','1','1','1','2','2','2','2','2','2','2','3','3','3','4','4','4','4']

df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})

print(df)

          Name      Value  Ccy Group
0            ID  TAMARA_CO          0
1       Country    GERMANY          0
2          IBAN       FR56          0
3    ID_bal_amt         12  EUR     1
4   ID_bal_time       June  EUR     1
5      Dan_city     Berlin          1
6    ID_bal_mod       OPBD  EUR     1
7   Dan_country         55          1
8   ID_bal_type       CRDT          1
9    ID_bal_amt        432          2
10  ID_bal_time     August  EUR     2
11   ID_bal_mod       CLBD  EUR     2
12  ID_bal_type        DBT  USD     2
13      Dan_sex          M  USD     2
14      Dan_Age         22  USD     2
15  Dan_country        FRA          2
16      Dan_sex          M  CHF     3
17     Dan_city     Madrid          3
18  Dan_country        ESP  DKN     3
19   ID_bal_amt        432          4
20  ID_bal_time      March          4
21   ID_bal_mod       FABD  USD     4
22  ID_bal_type       CRDT  CHF     4 
  

我要减少此数据帧!我只通过保留与模式关联的行的子组来减少仅包含字符串“ bal”的行:“ CLBD”。这意味着我在“值”列,字符串“ CLBD”中搜索名称“ ID_bal_mod”,然后将所有其他名称ID_bal_amt,ID_bal_time,ID_bal_mod,ID_bal_type保留在同一组中。在我们的示例中,名称在第2组中。

此外,我想将其在“组”列中的值更改为0,而不会更改所有其他不包含字符串“ bal”的“名称”的“组”。所以我想将Dan_sex,Dan_sex和Dan_country保留在组2中。

因此,最后我想获得这个新的数据框,其中的索引也会重置。

          Name      Value  Ccy Group
0            ID  TAMARA_CO          0
1       Country    GERMANY          0
2          IBAN       FR56          0
3      Dan_city     Berlin          1
4   Dan_country         55          1
5    ID_bal_amt        432          0
6   ID_bal_time     August  EUR     0
7    ID_bal_mod       CLBD  EUR     0
8   ID_bal_type        DBT  USD     0
9       Dan_sex          M  USD     2
10      Dan_Age         22  USD     2
11  Dan_country        FRA          2
12      Dan_sex          M  CHF     3
13     Dan_city     Madrid          3
14  Dan_country        ESP  DKN     3

我的尝试:

# keeps only the rows with the string 'bal'
di = df[df['Name'].str.contains('bal')]

# return true or false if they are in the group that contains the mode 'CLBD'
di=[di['Value'].eq('CLBD').groupby(di['Group']).transform('any')]

[3     False
4     False
6     False
8     False
9      True
10     True
11     True
12     True
19    False
20    False
21    False
22    False
Name: Value, dtype: bool]

有人有一个有效的主意吗?抱歉,如果讲得不好,英语不是我的母语。

谢谢

1 个答案:

答案 0 :(得分:1)

IIUC,请尝试以下操作:

m1 = df['Value'].eq('CLBD').groupby(df['Group']).transform('any')
m2 = ~df['Name'].str.contains('bal')
df_out = df[m1 | m2].copy()
df_out['Group'] = df_out['Group'].mask(df_out['Name'].str.contains('bal'), 0)
df_out

输出:

           Name      Value  Ccy Group
0            ID  TAMARA_CO          0
1       Country    GERMANY          0
2          IBAN       FR56          0
5      Dan_city     Berlin          1
7   Dan_country         55          1
9    ID_bal_amt        432          0
10  ID_bal_time     August  EUR     0
11   ID_bal_mod       CLBD  EUR     0
12  ID_bal_type        DBT  USD     0
13      Dan_sex          M  USD     2
14      Dan_Age         22  USD     2
15  Dan_country        FRA          2
16      Dan_sex          M  CHF     3
17     Dan_city     Madrid          3
18  Dan_country        ESP  DKN     3