Using pd.DataFrame.agg to create feature vectors

时间:2017-10-12 10:04:44

标签: pandas

I want to calculate some features for a collection of time series, or columns if you want.

I know I can use pandas.DataFrame.agg for that but I can't seem to able to give custom names to the resulting rolumns/rows of the DataFrame.

The code below does what I want:

Note: This is just an example. I know I can pass ['sum', 'std', 'mean']etc. to agg but I'd like to do this for arbitrary aggregation functions.

import pandas as pd
import numpy as np

n_series = 5
n_time_samples = 10

data = np.random.rand(n_time_samples, n_series)
columns = ['s{:d}'.format(i) for i in range(n_series)]

df = pd.DataFrame(data, columns=columns)

df.agg([lambda x: x.mean(), 
        lambda x: x.std()], axis=0).T

The result is a feature vector for each time series:

    <lambda>  <lambda>
s0  0.406411  0.330624
s1  0.446666  0.301839
s2  0.498958  0.159052
s3  0.613881  0.353684
s4  0.455623  0.287457

However, I'd like to have a proper name for the features. It is not possible to pass a dictionary in order to do that:

# Throws KeyError
df.agg({'f1': lambda x: x.mean(), 
        'f2': lambda x: x.std()}, axis=0).T

I know I can just rename the columns by setting df.columns but I was wondering if I can solve this be using agg only.

As a side note: setting axis=1 will also fail:

df.agg([lambda x: x.mean(), 
        lambda x: x.std()], axis=1).T

this will throw

TypeError: ("'list' object is not callable", 'occurred at index 0')

but

# Note transpose
df.T.agg([lambda x: x.mean(), 
          lambda x: x.std()], axis=0).T

will work?

1 个答案:

答案 0 :(得分:0)

Here's one way.

In [1023]: def f1(x):
      ...:     return x.mean()
      ...:

In [1024]: def f2(x):
      ...:     return x.std()
      ...:

In [1025]: df.agg([f1, f2], axis=0).T
Out[1025]:
          f1        f2
s0  0.593445  0.282322
s1  0.554996  0.247396
s2  0.441740  0.321923
s3  0.379589  0.295618
s4  0.602647  0.259439

To use lambda funcs, set the __name__

In [1042]: f1_ = lambda x: x.mean()

In [1043]: f2_ = lambda x: x.std()

In [1044]: f1_.__name__ = 'f1x'

In [1045]: f2_.__name__ = 'f2x'

In [1046]: df.agg([f1_, f2_], axis=0).T
Out[1046]:
         f1x       f2x
s0  0.593445  0.282322
s1  0.554996  0.247396
s2  0.441740  0.321923
s3  0.379589  0.295618
s4  0.602647  0.259439