我的应用程序中有一个pandas DataFrame操作:
+----------+-------------+
| UserName | StartEdit |
+----------+-------------+
| John | 12-Jul-2015 |
| David | 16-Aug-2015 |
| Katie | 20-Aug-2015 |
| Cristin | 2-Sep-2015 |
| Katie | 12-Sep-2015 |
| John | 23-Nov-2015 |
| David | 2-Jan-2016 |
| David | 3-Jan-2016 |
| John | 10-Feb-2016 |
| Steven | 13-Mar-2016 |
| Steven | 14-Mar-2016 |
+----------+-------------+
我想用UserTeam创建另一个列。我知道Katie,Cristin和Steven一直都在同一个团队中:
owners_teams = {"Katie":"A", "Cristin":"B", "Steven":"C"}
所以当我df["UserTeam"] = df["UserName"].map(owners_teams)
时,我得到:
+----------+-------------+----------+
| UserName | StartEdit | UserTeam |
+----------+-------------+----------+
| John | 12-Jul-2015 | NaN |
| David | 16-Aug-2015 | NaN |
| Katie | 20-Aug-2015 | A |
| Cristin | 2-Sep-2015 | B |
| Katie | 12-Sep-2015 | A |
| John | 23-Nov-2015 | NaN |
| David | 2-Jan-2016 | NaN |
| David | 3-Jan-2016 | NaN |
| John | 10-Feb-2016 | NaN |
| Steven | 13-Mar-2016 | C |
| Steven | 14-Mar-2016 | C |
+----------+-------------+----------+
现在,我也知道:
John在A
C
移至01-Jan-2016
David在B
C
移至12-Dec-2015
changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]
我知道如何使用apply
进行循环并对所有规则进行硬编码,但我认为它并不高效。如何为大量用户以矢量化方式进行此操作?
预期结果:
+----------+-------------+----------+
| UserName | StartEdit | UserTeam |
+----------+-------------+----------+
| John | 12-Jul-2015 | A |
| David | 16-Aug-2015 | B |
| Katie | 20-Aug-2015 | A |
| Cristin | 2-Sep-2015 | B |
| Katie | 12-Sep-2015 | A |
| John | 23-Nov-2015 | A |
| David | 2-Jan-2016 | C |
| David | 3-Jan-2016 | C |
| John | 10-Feb-2016 | C |
| Steven | 13-Mar-2016 | C |
| Steven | 14-Mar-2016 | C |
+----------+-------------+----------+
答案 0 :(得分:4)
pd.merge_asof
这是pd.merge_asof
的完美用例,但要求您跟踪更改。设置执行该跟踪的另一个数据帧teams
。
注意
df
中的最短日期。teams = pd.DataFrame([
['Katie', 'A', pd.Timestamp('2015-07-12')],
['Cristin', 'B', pd.Timestamp('2015-07-12')],
['Steven', 'C', pd.Timestamp('2015-07-12')],
['John', 'A', pd.Timestamp('2015-07-12')],
['David', 'B', pd.Timestamp('2015-07-12')],
['David', 'C', pd.Timestamp('2015-12-12')],
['John', 'C', pd.Timestamp('2016-01-01')],
], columns=['UserName', 'Team', 'StartEdit'])
teams
UserName Team StartEdit
0 Katie A 2015-07-12
1 Cristin B 2015-07-12
2 Steven C 2015-07-12
3 John A 2015-07-12
4 David B 2015-07-12
5 David C 2015-12-12
6 John C 2016-01-01
根据文档,确保两个数据框按相关日期列排序。
pd.merge_asof(df, teams, on='StartEdit', by='UserName')
UserName StartEdit Team
0 John 2015-07-12 A
1 David 2015-08-16 B
2 Katie 2015-08-20 A
3 Cristin 2015-09-02 B
4 Katie 2015-09-12 A
5 John 2015-11-23 A
6 David 2016-01-02 C
7 David 2016-01-03 C
8 John 2016-02-10 C
9 Steven 2016-03-13 C
10 Steven 2016-03-14 C
答案 1 :(得分:3)
一种方法是在pd.DataFrame.loc
循环中使用for
。在效率方面,您应该测试并确定这是否适合您的用例。
changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]
for name, date, before, after in changes:
name_mask = df['UserName'] == name
df.loc[name_mask & (df['StartEdit'] < date), 'UserTeam'] = before
df.loc[name_mask & (df['StartEdit'] >= date), 'UserTeam'] = after
您还可以通过numpy.where
执行等效映射。