熊猫每日组合,条件基于第一个更高的值

时间:2017-02-02 23:08:20

标签: python pandas dataframe group-by

问题:

如何找到num_2>每天的第一次。 num_1。每日groupby条件基于第一个较高的值,如下例所示。

数据:

df = pd.DataFrame({
    'num_1':[1,2,3,4,5,6,7,8,9,10,11,12],
    'num_2':[1,2,10,5,5,6,7,8,100,101,102,15],    
    'dates':pd.date_range('1/1/2011', periods=12, freq='8h')})

df

    dates             num_1 num_2
0   2011-01-01 00:00:00 1   1
1   2011-01-01 08:00:00 2   2
2   2011-01-01 16:00:00 3   10
3   2011-01-02 00:00:00 4   5
4   2011-01-02 08:00:00 5   5
5   2011-01-02 16:00:00 6   6
6   2011-01-03 00:00:00 7   7
7   2011-01-03 08:00:00 8   8
8   2011-01-03 16:00:00 9   100
9   2011-01-04 00:00:00 10  101
10  2011-01-04 08:00:00 11  102
11  2011-01-04 16:00:00 12  15

我已强调此数据的条件为True的次数:

numberimagedf

期望的输出:

当条件为1时显示True的新列和0时的False

enter image description here

2 个答案:

答案 0 :(得分:2)

解决方案:

In [85]: df['result'] = \
    ...:     df.dates.isin(
    ...:         df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
    ...:           .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']).astype(int)
    ...:

In [86]: df
Out[86]:
                 dates  num_1  num_2  result
0  2011-01-01 00:00:00      1      1       0
1  2011-01-01 08:00:00      2      2       0
2  2011-01-01 16:00:00      3     10       1
3  2011-01-02 00:00:00      4      5       1
4  2011-01-02 08:00:00      5      5       0
5  2011-01-02 16:00:00      6      6       0
6  2011-01-03 00:00:00      7      7       0
7  2011-01-03 08:00:00      8      8       0
8  2011-01-03 16:00:00      9    100       1
9  2011-01-04 00:00:00     10    101       1
10 2011-01-04 08:00:00     11    102       0
11 2011-01-04 16:00:00     12     15       0

说明:一步一步:

In [80]: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False) \
    ...:   .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))
    ...:
Out[80]:
                  dates  num_1  num_2  result
0 2 2011-01-01 16:00:00      3     10       1
1 3 2011-01-02 00:00:00      4      5       1
2 8 2011-01-03 16:00:00      9    100       1
3 9 2011-01-04 00:00:00     10    101       1

In [81]: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False) \
    ...:   .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']
    ...:
Out[81]:
0  2   2011-01-01 16:00:00
1  3   2011-01-02 00:00:00
2  8   2011-01-03 16:00:00
3  9   2011-01-04 00:00:00
Name: dates, dtype: datetime64[ns]

In [82]: df.dates.isin(
    ...:     df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
    ...:       .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates'])
    ...:
Out[82]:
0     False
1     False
2      True
3      True
4     False
5     False
6     False
7     False
8      True
9      True
10    False
11    False
Name: dates, dtype: bool

In [83]: df.dates.isin(
    ...:     df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
    ...:       .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']).astype(int)
    ...:
Out[83]:
0     0
1     0
2     1
3     1
4     0
5     0
6     0
7     0
8     1
9     1
10    0
11    0
Name: dates, dtype: int32

答案 1 :(得分:2)

您可以apply lambda比较条件并使用idxmax返回首先出现此情况的索引标签,将这些行值分配给1:

In [36]:
# assign default value, this sets the dtype to int so we don't have to convert and fillna after the following line
df['result'] = 0
df.loc[df.groupby(df['dates'].dt.date).apply(lambda x: (x['num_2'] > x['num_1']).idxmax()),'result'] = 1
df

Out[36]:
                 dates  num_1  num_2  result
0  2011-01-01 00:00:00      1      1       0
1  2011-01-01 08:00:00      2      2       0
2  2011-01-01 16:00:00      3     10       1
3  2011-01-02 00:00:00      4      5       1
4  2011-01-02 08:00:00      5      5       0
5  2011-01-02 16:00:00      6      6       0
6  2011-01-03 00:00:00      7      7       0
7  2011-01-03 08:00:00      8      8       0
8  2011-01-03 16:00:00      9    100       1
9  2011-01-04 00:00:00     10    101       1
10 2011-01-04 08:00:00     11    102       0
11 2011-01-04 16:00:00     12     15       0