我几乎是Pandas的新手,所以我想知道在开始编码之前是否可以进行某项操作。
我有一组员工工作时间的数据,就像这样 (这些都是透明的,真实的东西是成千上万的记录)
ID Name Date Hour Type
0 123 Bob 01/01/2018 09:00 In
1 123 Bob 01/01/2018 09:30 Out
2 123 Bob 01/01/2018 10:00 In
3 123 Bob 01/01/2018 12:00 Out
4 123 Bob 01/01/2018 13:00 In
5 123 Bob 01/01/2018 17:00 Out
6 456 Max 01/01/2018 09:00 In
7 456 Max 01/01/2018 12:00 Out
8 456 Max 01/01/2018 13:00 In
9 456 Max 01/01/2018 17:00 Out
10 123 Bob 02/01/2018 09:00 In
11 123 Bob 02/01/2018 09:30 Out
12 123 Bob 02/01/2018 10:00 In
13 123 Bob 02/01/2018 17:00 Out
14 456 Max 02/01/2018 10:00 In
15 456 Max 02/01/2018 17:00 Out
我知道Python和Pandas在处理数据方面有多么强大,我想知道是否有必要在不进行迭代编码的情况下获得这种输出
ID Name Date HourWorked
0 123 Bob 01/01/2018 06:30
1 456 Max 01/01/2018 07:00
2 123 Bob 02/01/2018 07:30
3 456 Max 02/01/2018 07:00
最后,我需要(每个员工ID)计算每一天工作的小时/分钟
我观看了很多GroupBy示例,但我发现任何有用的东西。
TIA
答案 0 :(得分:4)
将小时数转换为datetime
,groupby
输入和输出'并采取差异。稍后将'ID'
和'Date'
的差异分组,即
df['Hour'] = pd.to_datetime(df['Hour'])
df['diff'] = df.groupby((df['Type'] == 'In').cumsum())['Hour'].diff()
df_new = df.groupby(['ID','Name','Date'])['diff'].sum().to_frame('Hours Worked')
Hours Worked
ID Name Date
123 Bob 01/01/2018 06:30:00
02/01/2018 07:30:00
456 Max 01/01/2018 07:00:00
02/01/2018 07:00:00
答案 1 :(得分:2)
使用groupby
+自定义功能。这假定你的“In”& “Out”时间正确配对和排序。
# convert series to timedelta
df['Hour'] = pd.to_timedelta(df['Hour']+':00')
# define total time calculation
def total_time(x):
return (x.iloc[1::2].values - x.iloc[::2].values).sum()
# apply groupby and convert to dataframe
res = df.groupby(['ID', 'Name', 'Date'])['Hour'].apply(total_time)\
.to_frame('Hours Worked').reset_index()
print(res)
ID Name Date Hours Worked
0 123 Bob 01/01/2018 06:30:00
1 123 Bob 02/01/2018 07:30:00
2 456 Max 01/01/2018 07:00:00
3 456 Max 02/01/2018 07:00:00
答案 2 :(得分:0)
但是,此解决方案假设您的Type
始终位于" In-Out"订单
df = pd.DataFrame({"ID": [123,123,123,123,456,456, 123,123, 456,456],
"Date": ["01/01/2018","01/01/2018", "01/01/2018", "01/01/2018", "01/01/2018", "01/01/2018",
"02/01/2018", "02/01/2018", "02/01/2018", "02/01/2018"],
"Hour": ["09:00","09:30","10:00","12:00","13:00","17:00", "10:00","12:00","13:00","17:00"],
"Type": ["In","Out","In","Out","In","Out", "In","Out","In","Out"]})
df["DateTime"] = pd.to_datetime(df["Hour"] + " " + df["Date"])
df.groupby(["ID", "Date"])["DateTime"].apply(list).\
apply(lambda x: [x[i+1] - x[i] for i in range(len(x) - 1)]).str[0::2].\
apply(lambda x: np.sum(x))