样本DF:
ID Name Price Condition Fit_Test
1 Apple 10 Good Super_Fit
2 Apple 10 OK Super_Fit
3 Apple 10 Bad Super_Fit
4 Orange 12 Good Not_Fit
5 Orange 12 OK Not_Fit
6 Banana 15 OK Medium_Fit
7 Banana 15 Bad Medium_Fit
8 Pineapple 25 OK Medium_Fit
9 Pineapple 25 OK Medium_Fit
10 Cherry 30 Bad Medium_Fit
预期DF:
ID Name Price Condition Fit_Test
1 Apple 10 Good Super_Fit
2 Apple 10 OK Super_Fit
3 Apple 10 Bad Super_Fit
4 Orange 12 Good Not_Fit
6 Banana 15 OK Medium_Fit
8 Pineapple 25 OK Medium_Fit
9 Pineapple 25 OK Medium_Fit
10 Cherry 30 Bad Medium_Fit
问题陈述:
我想用group-by
和Name
Price
,然后根据Condition
进行过滤。
如果在Name
和Price
中都存在良好,不良和确定这3个条件,则仅保留好一个条件,而Fit_Test不是Super_Fit
如果在“名称”和“价格”中存在“良好”和“确定”的条件,则仅保留“良好”一项(ID 4,5仅是预期的ID 4),而Fit_Test不是Super_Fit
如果在Name
和Price
的情况下存在“正常”和“不良”的条件,则仅保留“确定”一个(ID 6,7仅是预期的ID 6)并且Fit_Test不是Super_Fit
如果在Name
和Price
内存在OK和OK的条件,则存在Good和Good存在或仅存在Bad,则不执行任何操作,则仅保留OK ( ID 8,9,10是预期的ID 8,9,10),而Fit_Test不是Super_Fit
更新答案
df
的所有Fit_Test
的第一答案和测试编辑。在此答案中,预期DF 将不包含第2行和第3行,如答案中所示Fit_Test
时有效,并且仅在值不为Super_Fit
时有效。在两种解决方案中,基于Condition
列和2列分组的行过滤是相同的。
我在数字列上找到了带有filter + group by的东西,但在String列上找不到了。
答案 0 :(得分:2)
创建的想法set
用于比较:
a = df.join(df.groupby(['Price','Name'])['Condition'].apply(set).rename('m'),
on=['Price','Name'])['m']
print (a)
0 {Bad, Good, OK}
1 {Bad, Good, OK}
2 {Bad, Good, OK}
3 {Good, OK}
4 {Good, OK}
5 {Bad, OK}
6 {Bad, OK}
7 {OK}
8 {OK}
9 {Bad}
Name: m, dtype: object
m1 = (a == set({'Bad', 'Good', 'OK'})) | (a == set({'Good', 'OK'}))
m2 = a == set({'Bad', 'OK'})
#check if unique value - length of set is 1
m3 = a.str.len() == 1
m4 = df['Condition'] == 'Good'
m5 = df['Condition'] == 'OK'
df = df[(m1 & m4) | (m2 & m5) | m3]
print (df)
ID Name Price Condition
0 1 Apple 10 Good
3 4 Orange 12 Good
5 6 Banana 15 OK
7 8 Pineapple 25 OK
8 9 Pineapple 25 OK
9 10 Cherry 30 Bad
编辑测试:
要进行测试,请使用assign
:
print (df.assign(sets=a, m1 = m1, m2=m2, m3=m3, m4=m4, m5=m5, m=m))
ID Name Price Condition sets m1 m2 m3 \
0 1 Apple 10 Good {Bad, Good, OK} True False False
1 2 Apple 10 OK {Bad, Good, OK} True False False
2 3 Apple 10 Bad {Bad, Good, OK} True False False
3 4 Orange 12 Good {Good, OK} True False False
4 5 Orange 12 OK {Good, OK} True False False
5 6 Banana 15 OK {Bad, OK} False True False
6 7 Banana 15 Bad {Bad, OK} False True False
7 8 Pineapple 25 OK {OK} False False True
8 9 Pineapple 25 OK {OK} False False True
9 10 Cherry 30 Bad {Bad} False False True
m4 m5 m
0 True False True
1 False True False
2 False False False
3 True False True
4 False True False
5 False True True
6 False False False
7 False True True
8 False True True
9 False False True
编辑更新:
对于新条件,请使用:
m6 = df['Fit_Test'] == 'Super_Fit'
df = df[((m1 & m4) | (m2 & m5) | m3) | m6]
print (df)
ID Name Price Condition Fit_Test
0 1 Apple 10 Good Super_Fit
1 2 Apple 10 OK Super_Fit
2 3 Apple 10 Bad Super_Fit
3 4 Orange 12 Good Not_Fit
5 6 Banana 15 OK Medium_Fit
7 8 Pineapple 25 OK Medium_Fit
8 9 Pineapple 25 OK Medium_Fit
9 10 Cherry 30 Bad Medium_Fit
答案 1 :(得分:0)
如果您没有太多条件(例如这里只有3个条件),则以下是一种简单的解决方法:
df.loc[df["Condition"] == 'Good',"Condition"] = 3
df.loc[df["Condition"] == 'OK',"Condition"] = 2
df.loc[df["Condition"] == 'Bad',"Condition"] = 1
df = df.groupby(['Name','Price']).max()
df.loc[df["Condition"] == 3] = "Good"
df.loc[df["Condition"] == 2] = "OK"
df.loc[df["Condition"] == 1] = "Bad"