我有以下数据框df:
Subject Marks1 Marks2
English 1 10
English 1.5 20
English 1.7 30
English 3 40
Science 1 10
Science 1.5 20
Science 1.7 15
Science 3 35
我想按主题分组,并检查Marks2是否随着Marks1的增加而严格增加。如果不是,那么我想从df中删除该组并将其放在另一个问题数据框中。所以最后我将拥有 df:
Subject Marks1 Marks2
English 1 10
English 1.5 20
English 1.7 30
English 3 40
问题:
Subject Marks1 Marks2
Science 1 10
Science 1.5 20
Science 1.7 15
Science 3 35
答案 0 :(得分:2)
对所有列使用DataFrameGroupBy.diff
进行比较,以比较少的值,例如0
与DataFrame.any
,然后通过Series.isin
获得vals
作为主题和过滤器输出:< / p>
m = df.groupby('Subject').diff().le(0).any(axis=1)
vals = df.loc[m, 'Subject']
mask = df['Subject'].isin(vals)
df1 = df[mask]
print (df1)
Subject Marks1 Marks2
4 Science 1.0 10
5 Science 1.5 20
6 Science 1.7 15
7 Science 3.0 35
df2 = df[~mask]
print (df2)
Subject Marks1 Marks2
0 English 1.0 10
1 English 1.5 20
2 English 1.7 30
3 English 3.0 40
编辑:每个组的瓶颈不同,如果可以对所有组进行排序,则可以通过以下方式提高性能:
#columns used for difference (passed to groupby())
cols = ['Subject','col1','col2']
#sorting by all columns (if possible and if necessary)
df = df.sort_values(cols)
m = df[['Marks1','Marks2']].diff().le(0).any(axis=1) & df.duplicated(cols)
vals = df.loc[m, 'Subject']
mask = df['Subject'].isin(vals)
df1 = df[mask]
答案 1 :(得分:0)
.filter()
使用lambda
函数来查找.diff()
以识别问题
issues=df.groupby('Subject').filter(lambda x : ((x.Marks1.diff()>0)&(x.Marks2.diff()<0)).any())
print(issues)
Subject Marks1 Marks2
4 Science 1.0 10
5 Science 1.5 20
6 Science 1.7 15
7 Science 3.0 35
Noissues=df[~df.index.isin(issues.index)]
print(Noissues)
Subject Marks1 Marks2
0 English 1.0 10
1 English 1.5 20
2 English 1.7 30
3 English 3.0 40