我有一个df:
df2 = pd.DataFrame({
'ID': ['James', 'James', 'James',
'Max', 'Max', 'Max', 'Max', 'Max',
'Park', 'Park', 'Park',
'Tom', 'Tom', 'Tom', 'Tom'],
'From_num': [78, 420, 'Started', 298, 78, 36, 298, 'Started', 28, 311, 'Started', 60, 520, 99, 'Started'],
'To_num': [96, 78, 420, 36, 78, 78, 36, 298, 112, 28, 311, 150, 520, 78, 99],
'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
'2019-08-26', '2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2019-11-22',
'2019-04-12', '2019-10-16', '2019-08-26', '2018-12-11', '2018-10-09']})
它看起来像这样:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 78 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
8 Park 28 112 2020-05-21
9 Park 311 28 2019-11-22
10 Park Started 311 2019-04-12
11 Tom 60 150 2019-10-16
12 Tom 520 520 2019-08-26
13 Tom 99 78 2018-12-11
14 Tom Started 99 2018-10-09
我希望为每个ID(人名)创建一个新数据框,其中任一列在组中包含数字78(无论78在From_num或To_num中出现,还是在这两者中都出现),并删除这两列都不包含的人78,在这种情况下为“公园”。我已经写了这样的代码:
find_nn = df2.groupby('ID').apply(lambda x: x[['From_num', 'To_num']].isin([78]).any())
find_nn.columns = ['from_bool', 'to_bool']
find_nn['bool_result'] = find_nn['from_bool'] | find_nn['to_bool']
bool_nn = find_nn['bool_result'].reset_index()
df2_new = pd.merge(left=df2, right=bool_nn, on='ID', copy=False)
df2_new = df2_new[df2_new['bool_result'] == True]
它正在工作,但是非常冗余且缓慢,因为在我的实际情况下,数据集更加复杂。如果您有任何更好的主意,请提供帮助。非常感谢!!期望像这样:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 78 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
11 Tom 60 150 2019-10-16
12 Tom 520 520 2019-08-26
13 Tom 99 78 2018-12-11
14 Tom Started 99 2018-10-09
答案 0 :(得分:7)
让我们尝试filter
df1 = df2.groupby('ID').filter(lambda x : x[['From_num','To_num']].eq(78).any().any())
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 78 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
11 Tom 60 150 2019-10-16
12 Tom 520 520 2019-08-26
13 Tom 99 78 2018-12-11
14 Tom Started 99 2018-10-09
为了速度
m=df2[['From_num','To_num']].eq(78).any(axis=1).groupby(df2.ID).transform('any')
df1=df2[m]
答案 1 :(得分:6)
这是获取相同数据的更简单方法。您可以将2个过滤器应用于df2。第一行说,过滤df2,其中From_num或To_num = 78,然后获取这些行的ID。 在下一行,我们用这些ID过滤df2。
ids = df2[(df2.From_num == 78) | (df2.To_num == 78)]['ID'].unique()
df2_new = df2[df2['ID'].isin(ids)]
答案 2 :(得分:4)
这对你来说是个好人:
df2[df2['ID'].isin((df2.set_index(['ID','Date']).stack() == 78).any(level=0).loc[lambda x:x].index)]
输出:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 78 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
11 Tom 60 150 2019-10-16
12 Tom 520 520 2019-08-26
13 Tom 99 78 2018-12-11
14 Tom Started 99 2018-10-09