分组时应用自定义函数返回NaN

时间:2016-05-26 09:16:30

标签: python python-3.x pandas aggregate series

鉴于dict,performances,存储系列类型:

2015-02-28           NaN
2015-03-02    100.000000
2015-03-03     98.997117
2015-03-04     98.909215
2015-03-05     99.909979
2015-03-06    100.161486
2015-03-09    100.502772
2015-03-10    101.685314
2015-03-11    102.518433
2015-03-12    102.427237
2015-03-13    103.424257
2015-03-16    102.669184
2015-03-17    102.181841
2015-03-18    102.436339
2015-03-19    102.672482
2015-03-20    102.238386
2015-03-23    101.460082
...

我想按月对它们进行分组,但只为每个月的数据集选择不是np.nan的第一个值:

for perf in performance:
    performance[perf] = performance[perf].groupby(performance[perf].index.month).apply(return_first)


def return_first(array_like):
    # Return data from 1st of month, or first value that is not np.nan
    for i in range(len(array_like)):
        if np.isnan(array_like[i]):
            continue
        else:
            return(array_like[i])

然而,这会返回nan值:

2015-02-28   NaN
2015-03-02   NaN
2015-03-03   NaN
2015-03-04   NaN
2015-03-05   NaN
2015-03-06   NaN
2015-03-09   NaN
2015-03-10   NaN
2015-03-11   NaN
2015-03-12   NaN
2015-03-13   NaN
2015-03-16   NaN
2015-03-17   NaN
2015-03-18   NaN
2015-03-19   NaN
2015-03-20   NaN
2015-03-23   NaN
...

应该是:

2015-03-02   100   
...

我不能怀疑我的索引,这似乎是一个完美的pd.DateTimeIndex

DatetimeIndex(['2015-02-28', '2015-03-02', '2015-03-03', '2015-03-04',
           '2015-03-05', '2015-03-06', '2015-03-09', '2015-03-10',
           '2015-03-11', '2015-03-12',
           ...
           '2016-02-16', '2016-02-17', '2016-02-18', '2016-02-19',
           '2016-02-22', '2016-02-23', '2016-02-24', '2016-02-25',
           '2016-02-26', '2016-02-29'],
          dtype='datetime64[ns]', length=265, freq=None)

我哪里出错了?

1 个答案:

答案 0 :(得分:1)

如果每个月至少有一个非NaN值,请使用first_valid_index

print (df.b.groupby(df.index.month).apply(lambda x: x[x.first_valid_index()]))

更一般的解决方案,如果某个月的所有值均为NaN,则返回NaN

def f(x):
    if x.first_valid_index() is None:
        return np.nan
    else:
        return x[x.first_valid_index()]

print (df.b.groupby(df.index.month).apply(f))

2      NaN
3    100.0
Name: b, dtype: float64

如果您希望按years分组,months使用to_period

print (df.b.groupby(df.index.to_period('M')).apply(f))
2015-02      NaN
2015-03    100.0
Freq: M, Name: b, dtype: float64

样品:

import pandas as pd
import numpy as np

df = pd.DataFrame({'b': pd.Series({ pd.Timestamp('2015-07-19 00:00:00'): 102.67248199999999,  pd.Timestamp('2015-04-05 00:00:00'):  np.nan,  pd.Timestamp('2015-02-25 00:00:00'):  np.nan,  pd.Timestamp('2015-04-09 00:00:00'): 100.50277199999999,  pd.Timestamp('2015-06-18 00:00:00'): 102.436339,  pd.Timestamp('2015-06-16 00:00:00'): 102.669184,  pd.Timestamp('2015-04-10 00:00:00'): 101.68531400000001,  pd.Timestamp('2015-05-12 00:00:00'): 102.42723700000001,  pd.Timestamp('2015-07-20 00:00:00'): 102.23838600000001,  pd.Timestamp('2015-06-17 00:00:00'):  np.nan,  pd.Timestamp('2015-08-23 00:00:00'): 101.460082,  pd.Timestamp('2015-03-03 00:00:00'): 98.997117000000003,  pd.Timestamp('2015-03-02 00:00:00'): 100.0,  pd.Timestamp('2015-05-11 00:00:00'): 102.518433,  pd.Timestamp('2015-03-04 00:00:00'): 98.909215000000003, pd.Timestamp('2015-05-13 00:00:00'): 103.424257,  pd.Timestamp('2015-04-06 00:00:00'):  np.nan})})
print (df)

                     b
2015-02-25         NaN
2015-03-02  100.000000
2015-03-03   98.997117
2015-03-04   98.909215
2015-04-05         NaN
2015-04-06         NaN
2015-04-09  100.502772
2015-04-10  101.685314
2015-05-11  102.518433
2015-05-12  102.427237
2015-05-13  103.424257
2015-06-16  102.669184
2015-06-17         NaN
2015-06-18  102.436339
2015-07-19  102.672482
2015-07-20  102.238386
2015-08-23  101.460082
def f(x):
    if x.first_valid_index() is None:
        return np.nan
    else:
        return x[x.first_valid_index()]

print (df.b.groupby(df.index.to_period('M')).apply(f))
2015-02           NaN
2015-03    100.000000
2015-04    100.502772
2015-05    102.518433
2015-06    102.669184
2015-07    102.672482
2015-08    101.460082
Freq: M, Name: b, dtype: float64