找到每个pandas组

时间:2018-02-08 18:49:16

标签: python pandas

import pandas as pd
df = pd.DataFrame({'date': ['2014-06-22 17:46:00', '2014-06-24 16:52:00', '2014-06-25 20:02:00', '2014-06-25 17:55:00', '2014-07-02 11:36:00', '2014-07-06 12:40:00', '2014-07-05 12:46:00', '2014-07-27 15:12:00'],
    'type': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C']})

>>> df
                  date type
0  2014-06-22 17:46:00    A
1  2014-06-24 16:52:00    A
2  2014-06-25 20:02:00    A
3  2014-06-25 17:55:00    B
4  2014-07-02 11:36:00    B
5  2014-07-06 12:40:00    C
6  2014-07-05 12:46:00    C
7  2014-07-27 15:12:00    C

如何获得每组最佳时间的指数,例如17:00(无视当天)?期望的结果将是:

>>> df.groupby('type').date. ???
type
A    1
B    3
C    7
Name: date, dtype: int64

另外,如果我想找到最接近但早于给定时间的内容怎么办?再次17:00,它需要返回:

>>> df.groupby('type').date. ???
type
A    1
B    4
C    7
Name: date, dtype: int64

3 个答案:

答案 0 :(得分:1)

以下是使用idxmin

的方法
df['New']=abs(pd.to_datetime('2018-02-08'+' '+df['date'].dt.time.astype(str))-pd.to_datetime('2018-02-08 17:00'))


df.groupby('type').New.idxmin()
Out[123]: 
type
A    2
B    3
C    7
Name: New, dtype: int64

用于转发搜索

df['New']=(pd.to_datetime('2018-02-08'+' '+df['date'].dt.time.astype(str))-pd.to_datetime('2018-02-08 17:00'))
df['New']=df['New'].where(df['New'].dt.total_seconds()<0).abs()
df.groupby('type').New.idxmin()
Out[134]: 
type
A    0
B    4
C    7
Name: New, dtype: int64

答案 1 :(得分:1)

获取默认日期,添加time s并与时间t获得差异:

首先通过DataFrameGroupBy.idxmin得到每组绝对值的最小指数,对于第二个解,通过DataFrameGroupBy.idxmax和{{3}将NaN替换为正值,得到每个组的最大负值}:

df = pd.DataFrame({'date': ['2014-06-22 17:46:00', '2014-06-22 16:52:00', 
                            '2014-06-25 20:02:00', '2014-06-25 17:55:00', 
                            '2014-07-02 11:36:00', '2014-07-06 12:40:00', 
                            '2014-07-05 12:46:00', '2014-07-27 15:12:00'],
    'type': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C']})
#convert column to datetimes
df['date'] = pd.to_datetime(df.date)

t = '17:00:00'
a = pd.to_datetime(df['date'].dt.strftime('%H:%M:%S')) - pd.to_datetime(t)
print (a)
0            00:46:00
1   -1 days +23:52:00
2            03:02:00
3            00:55:00
4   -1 days +18:36:00
5   -1 days +19:40:00
6   -1 days +19:46:00
7   -1 days +22:12:00
Name: date, dtype: timedelta64[ns]


b = a.abs().groupby(df['type']).idxmin()
print (b)
type
A    1
B    3
C    7
Name: date, dtype: int64

c = a.mask(a > pd.Timedelta(0)).groupby(df['type']).idxmax()
print (c)
type
A    1
B    4
C    7
Name: date, dtype: int64

<强>详细

df1 = pd.concat([df, a, a.abs(), a.mask(a >  pd.Timedelta(0))], axis=1)
df1.columns = ['date','type','diff','absolute diff','max negative']
print (df1)
                 date type              diff absolute diff      max negative
0 2014-06-22 17:46:00    A          00:46:00      00:46:00               NaT
1 2014-06-22 16:52:00    A -1 days +23:52:00      00:08:00 -1 days +23:52:00
2 2014-06-25 20:02:00    A          03:02:00      03:02:00               NaT
3 2014-06-25 17:55:00    B          00:55:00      00:55:00               NaT
4 2014-07-02 11:36:00    B -1 days +18:36:00      05:24:00 -1 days +18:36:00
5 2014-07-06 12:40:00    C -1 days +19:40:00      04:20:00 -1 days +19:40:00
6 2014-07-05 12:46:00    C -1 days +19:46:00      04:14:00 -1 days +19:46:00
7 2014-07-27 15:12:00    C -1 days +22:12:00      01:48:00 -1 days +22:12:00

答案 2 :(得分:0)

基于@ Wen和@ jezrael的解决方案的逻辑,在等待他们的编辑来克服一些小问题时,我找到了另一个功能正常的:

df = pd.DataFrame({'date': ['2014-06-22 17:46:00', '2014-06-24 16:52:00', '2014-06-25 20:02:00', '2014-06-25 17:55:00', '2014-07-02 11:36:00', '2014-07-06 12:40:00', '2014-07-05 12:46:00', '2014-07-27 15:12:00'],
    'type': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C']})

print(df)
                  date type
0  2014-06-22 17:46:00    A
1  2014-06-24 16:52:00    A
2  2014-06-25 20:02:00    A
3  2014-06-25 17:55:00    B
4  2014-07-02 11:36:00    B
5  2014-07-06 12:40:00    C
6  2014-07-05 12:46:00    C
7  2014-07-27 15:12:00    C
问题1:

#convert str to datetime type
df['dateDT'] = pd.to_datetime(df.date)
#create col with specific time, and each lines date
df['5pm'] = pd.to_datetime(df.dateDT.dt.date.astype(str) + ' 17:00:00')
#find time difference in seconds
df['tDiff5pm'] = abs((df.dateDT - df['5pm']).dt.total_seconds())
#get min diff per group 
print(df.tDiff5pm.abs().groupby(df['type']).idxmin())
type
A    1
B    3
C    7
Name: tDiff5pm, dtype: int64
问题2:

#as above but no absolute values
df['tDiff5pm2'] = (df.dateDT - df['5pm']).dt.total_seconds()
#NaNs to later times, then abs (got this from @Wen's answer
df['onlyEarlier']=df['tDiff5pm2'].where(df['tDiff5pm2']<0).abs()
#get min diff per group 
print(df.groupby('type').onlyEarlier.idxmin())
type
A    1
B    4
C    7
Name: onlyEarlier, dtype: int64