有效的条件值设置方式

时间:2018-05-23 14:42:57

标签: python python-3.x pandas dataframe

我的应用程序中有一个pandas DataFrame操作:

+----------+-------------+
| UserName |  StartEdit  |
+----------+-------------+
| John     | 12-Jul-2015 |
| David    | 16-Aug-2015 |
| Katie    | 20-Aug-2015 |
| Cristin  | 2-Sep-2015  |
| Katie    | 12-Sep-2015 |
| John     | 23-Nov-2015 |
| David    | 2-Jan-2016  |
| David    | 3-Jan-2016  |
| John     | 10-Feb-2016 |
| Steven   | 13-Mar-2016 |
| Steven   | 14-Mar-2016 |
+----------+-------------+

我想用UserTeam创建另一个列。我知道Katie,Cristin和Steven一直都在同一个团队中:

owners_teams = {"Katie":"A", "Cristin":"B", "Steven":"C"}

所以当我df["UserTeam"] = df["UserName"].map(owners_teams)时,我得到:

+----------+-------------+----------+
| UserName |  StartEdit  | UserTeam |
+----------+-------------+----------+
| John     | 12-Jul-2015 | NaN      |
| David    | 16-Aug-2015 | NaN      |
| Katie    | 20-Aug-2015 | A        |
| Cristin  | 2-Sep-2015  | B        |
| Katie    | 12-Sep-2015 | A        |
| John     | 23-Nov-2015 | NaN      |
| David    | 2-Jan-2016  | NaN      |
| David    | 3-Jan-2016  | NaN      |
| John     | 10-Feb-2016 | NaN      |
| Steven   | 13-Mar-2016 | C        |
| Steven   | 14-Mar-2016 | C        |
+----------+-------------+----------+

现在,我也知道:

John在A

上从C移至01-Jan-2016

David在B

上从C移至12-Dec-2015
changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]

我知道如何使用apply进行循环并对所有规则进行硬编码,但我认为它并不高效。如何为大量用户以矢量化方式进行此操作?

预期结果:

+----------+-------------+----------+
| UserName |  StartEdit  | UserTeam |
+----------+-------------+----------+
| John     | 12-Jul-2015 | A        |
| David    | 16-Aug-2015 | B        |
| Katie    | 20-Aug-2015 | A        |
| Cristin  | 2-Sep-2015  | B        |
| Katie    | 12-Sep-2015 | A        |
| John     | 23-Nov-2015 | A        |
| David    | 2-Jan-2016  | C        |
| David    | 3-Jan-2016  | C        |
| John     | 10-Feb-2016 | C        |
| Steven   | 13-Mar-2016 | C        |
| Steven   | 14-Mar-2016 | C        |
+----------+-------------+----------+

2 个答案:

答案 0 :(得分:4)

pd.merge_asof

这是pd.merge_asof的完美用例,但要求您跟踪更改。设置执行该跟踪的另一个数据帧teams

注意

  • 我使用的初始日期是df中的最短日期。
  • 我为John和David设置了你提到的初始团队以及初始日期。
  • 我添加了另一个条目,以显示约翰和大卫何时换队。
teams = pd.DataFrame([
    ['Katie', 'A', pd.Timestamp('2015-07-12')],
    ['Cristin', 'B', pd.Timestamp('2015-07-12')],
    ['Steven', 'C', pd.Timestamp('2015-07-12')],
    ['John', 'A', pd.Timestamp('2015-07-12')],
    ['David', 'B', pd.Timestamp('2015-07-12')],
    ['David', 'C', pd.Timestamp('2015-12-12')],
    ['John', 'C', pd.Timestamp('2016-01-01')],
], columns=['UserName', 'Team', 'StartEdit'])

teams

  UserName Team  StartEdit
0    Katie    A 2015-07-12
1  Cristin    B 2015-07-12
2   Steven    C 2015-07-12
3     John    A 2015-07-12
4    David    B 2015-07-12
5    David    C 2015-12-12
6     John    C 2016-01-01

根据文档,确保两个数据框按相关日期列排序。

pd.merge_asof(df, teams, on='StartEdit', by='UserName')

   UserName  StartEdit Team
0      John 2015-07-12    A
1     David 2015-08-16    B
2     Katie 2015-08-20    A
3   Cristin 2015-09-02    B
4     Katie 2015-09-12    A
5      John 2015-11-23    A
6     David 2016-01-02    C
7     David 2016-01-03    C
8      John 2016-02-10    C
9    Steven 2016-03-13    C
10   Steven 2016-03-14    C

答案 1 :(得分:3)

一种方法是在pd.DataFrame.loc循环中使用for。在效率方面,您应该测试并确定这是否适合您的用例。

changes = [("John", "01-Jan-2016", "A", "C"), ("David", "12-Dec-2015", "B", "C")]

for name, date, before, after in changes:
    name_mask = df['UserName'] == name
    df.loc[name_mask & (df['StartEdit'] < date), 'UserTeam'] = before
    df.loc[name_mask & (df['StartEdit'] >= date), 'UserTeam'] = after

您还可以通过numpy.where执行等效映射。