我有一个大型DataFrame(超过1000000行),其中包含有关员工的信息。
它包含有关员工ID,记录日期和离职状态的信息。如果营业额不等于1,则表明该雇员当前正在工作。
此处示例:
test_df =\
pd.DataFrame({'empl_id': [1,2,3,1,2,3,1,2,1,2,1,2,3],
'rec_date':pd.to_datetime(['20080131','20080131','20080131',
'20080229', '20080229', '20080229',
'20080331', '20080331',
'20080430', '20080430',
'20080531', '20080531', '20080531'],
format='%Y%m%d'),
'turnover':[0,0,0,0,0,1,0,0,0,0,1,0,0]})
+----+-----------+---------------------+------------+
| | empl_id | rec_date | turnover |
+====+===========+=====================+============+
| 0 | 1 | 2008-01-31 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 1 | 2 | 2008-01-31 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 2 | 3 | 2008-01-31 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 3 | 1 | 2008-02-29 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 4 | 2 | 2008-02-29 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 5 | 3 | 2008-02-29 00:00:00 | 1 |
+----+-----------+---------------------+------------+
| 6 | 1 | 2008-03-31 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 7 | 2 | 2008-03-31 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 8 | 1 | 2008-04-30 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 9 | 2 | 2008-04-30 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 10 | 1 | 2008-05-31 00:00:00 | 1 |
+----+-----------+---------------------+------------+
| 11 | 2 | 2008-05-31 00:00:00 | 0 |
+----+-----------+---------------------+------------+
| 12 | 3 | 2008-05-31 00:00:00 | 0 |
+----+-----------+---------------------+------------+
我需要显示员工是否相对于记录中指定的时间在例如两个月后离开公司
我找到了解决方案,但处理速度太慢。对于这种大小的DataFrame,将需要超过54个小时!
这是我的剧本:
from datetime import datetime, date, timedelta
import calendar
import pandas as pd
import numpy as np
# look only in employees with turnover
res = test_df.groupby('empl_id')['turnover'].sum()
keys_with_turn = res[res>0].index
# function for add months
def add_months(sourcedate,months):
month = sourcedate.month - 1 + months
year = sourcedate.year + month // 12
month = month % 12 + 1
day = min(sourcedate.day, calendar.monthrange(year,month)[1])
return date(year,month,day)
# add 2 months and convert to timestamp
test_df['rec_date_plus_2'] = test_df['rec_date'].apply(lambda x: add_months(x, 2))
test_df['rec_date_plus_2'] = pd.to_datetime(test_df['rec_date_plus_2'])
test_df['turn_nxt_2'] = np.nan
for i in range(len(keys_with_turn)): # loop over employees ids
for index, row in test_df[test_df['empl_id']==keys_with_turn[i]].iterrows(): # loop over all recs with employee
a = row['rec_date']
b = row['rec_date_plus_2']
turn_coef = test_df[(test_df['empl_id']==keys_with_turn[i]) &
((test_df['rec_date']>=a) & (test_df['rec_date']<=b))]['turnover'].sum()
test_df.loc[(test_df['rec_date']==a) &
(test_df['empl_id']==keys_with_turn[i]), 'turn_nxt_2'] = 0 if turn_coef == 0 else 1
test_df['turn_nxt_2'].fillna(0, inplace=True)
我正在寻找的结果:
+----+-----------+---------------------+------------+--------------+
| | empl_id | rec_date | turnover | turn_nxt_2 |
+====+===========+=====================+============+==============+
| 0 | 1 | 2008-01-31 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
| 1 | 2 | 2008-01-31 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
| 2 | 3 | 2008-01-31 00:00:00 | 0 | 1 |
+----+-----------+---------------------+------------+--------------+
| 3 | 1 | 2008-02-29 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
| 4 | 2 | 2008-02-29 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
| 5 | 3 | 2008-02-29 00:00:00 | 1 | 1 |
+----+-----------+---------------------+------------+--------------+
| 6 | 1 | 2008-03-31 00:00:00 | 0 | 1 |
+----+-----------+---------------------+------------+--------------+
| 7 | 2 | 2008-03-31 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
| 8 | 1 | 2008-04-30 00:00:00 | 0 | 1 |
+----+-----------+---------------------+------------+--------------+
| 9 | 2 | 2008-04-30 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
| 10 | 1 | 2008-05-31 00:00:00 | 1 | 1 |
+----+-----------+---------------------+------------+--------------+
| 11 | 2 | 2008-05-31 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
| 12 | 3 | 2008-05-31 00:00:00 | 0 | 0 |
+----+-----------+---------------------+------------+--------------+
如何以更快,更多的方式做到这一点?
答案 0 :(得分:1)
一种更简单的方法是制作一个重复的数据帧并在适当的键上合并。
我做了一个简单的代码来演示,尽管可以改进,但是它是:
从您的原始数据集开始,我们导入一个新的库并转换日期类型,以便稍后可以对其执行操作:
import pandas as pd
from dateutil.relativedelta import relativedelta
DF_1 = pd.DataFrame({'empl_id': [1,2,3,1,2,3,1,2,1,2,1,2],
'rec_date':pd.to_datetime(['20080131','20080131','20080131',
'20080229', '20080229', '20080229',
'20080331', '20080331',
'20080430', '20080430',
'20080531', '20080531'],
format='%Y%m%d'),
'turnover':[0,0,0,0,0,1,0,0,0,0,1,0]})
print (type(DF_1.rec_date[0]))
DF_1.rec_date = DF_1.rec_date.map(lambda X: X.date())
print (type(DF_1.rec_date[0]))
现在,我们使用合并列制作一个重复的数据框,该列引用每个条目的所需合并日期
DF_2 = DF_1.copy()
DF_2['merge_value'] = DF_2.rec_date - relativedelta(months=2)
我们还在原始数据帧上创建了一个合并列,因此在pd.merge中更容易引用
DF_1['merge_value'] = DF_1.rec_date.values
现在我们要做的就是合并!
DF_1.merge(DF_2, on=['empl_id','merge_value'])
另一条建议是首先尝试使用较小的样本,如果您认为不是主键,则合并有时会出现问题! (在这种情况下,如果['empl_id','merge_value']的相同组合有多个条目)
希望有帮助!