熊猫数据框的分组依据和检索日期范围

时间:2018-10-26 11:04:01

标签: python pandas

这是我正在处理的数据框。定义了两个工资期: 每个月的前15天和后15天。

         date  employee_id hours_worked   id job_group  report_id
0  2016-11-14            2         7.50  385         B         43
1  2016-11-15            2         4.00  386         B         43
2  2016-11-30            2         4.00  387         B         43
3  2016-11-01            3        11.50  388         A         43
4  2016-11-15            3         6.00  389         A         43
5  2016-11-16            3         3.00  390         A         43
6  2016-11-30            3         6.00  391         A         43

我需要按员工ID和job_group分组,但同时 我必须达到该分组行的日期范围。 那意味着 例如,分组结果类似于雇员ID 1的以下结果:

预期输出:

         date  employee_id hours_worked  job_group  report_id
1  2016-11-15            2         11.50        B         43
2  2016-11-30            2         4.00         B         43
4  2016-11-15            3         17.50        A         43
5  2016-11-16            3         9.00         A         43

使用pandas dataframe groupby是否可能? 请帮忙谢谢。让我知道问题是否很清楚。

2 个答案:

答案 0 :(得分:1)

SMGrouper一起使用,最后添加SemiMonthEnd

df['date'] = pd.to_datetime(df['date'])

d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
       .agg(d)
       .reset_index())

df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
   employee_id job_group       date  hours_worked  report_id
0            2         B 2016-11-15          11.5         43
1            2         B 2016-11-30           4.0         43
2            3         A 2016-11-15          17.5         43
3            3         A 2016-11-30           9.0         43

答案 1 :(得分:1)

a。首先,(对于每个madmin’s-MacBook-Pro:sentinel-be jscherman$ curl -G http://localhost:8087/query --data-urlencode "q=CREATE DATABASE mydb" curl: (52) Empty reply from server madmin’s-MacBook-Pro:sentinel-be jscherman$ docker exec -ti influxdb3 /bin/bash > root@f895d3e35c41:/# influx -port 8087 Failed to connect to http://localhost:8087: Get http://localhost:8087/ping: dial tcp 127.0.0.1:8087: connect: connection refused Please check your connection settings and ensure 'influxd' is running. root@f895d3e35c41:/# )将multiple Grouperemployee_id列上的.sum()一起使用。其次,使用DateOffset达到每两周hours_worked的专栏。经过这两个步骤后,我已经基于2个括号(日期范围)在分组DF中分配了date-如果day of month (from the date column)为<= 15,则将date设置为{{ 1}}到15,否则将day设置为30。然后使用此date来组装新的date。我根据12计算了月底日。

b。 (对于每个day,请获取dayemployee_id列的.last() record

c。合并和b。在job_group键上

report_id

employee_id# a. hours = (df.groupby([ pd.Grouper(key='employee_id'), pd.Grouper(key='date', freq='SM') ])['hours_worked'] .sum() .reset_index()) hours['date'] = pd.to_datetime(hours['date']) hours['date'] = hours['date'] + pd.DateOffset(days=14) # Assign day based on bracket (date range) 0-15 or bracket (date range) >15 from pandas.tseries.offsets import MonthEnd hours['bracket'] = hours['date'] + MonthEnd(0) hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15 hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year, month=hours['date'].dt.month, day=hours['bracket'])) hours.drop('bracket', axis=1, inplace=True) # b. others = (df.groupby('employee_id')['job_group','report_id'] .last() .reset_index()) # c. merged = hours.merge(others, how='inner', on='employee_id') 的原始数据

employee_id==1

输出

employeeid==3