无法'合并''DataFrameGroupBy'

时间:2018-03-13 19:28:59

标签: python pandas pandas-groupby

我有一个数据框,其中一列是分类数据,其余是浮点数。我根据数据类型将这两者分开。两个数据帧都有时间戳,时间戳是它们的索引。

我正在尝试聚合数字数据的统计数据,也是分类数据最常用的标签,持续5分钟。我分别处理每种类型,但我不能将两个组合在一起。

Telemetry=All[FLOATTYPE]
grouped = Telemetry.groupby(Telemetry.index.floor('5T'))
# computing various stats
grouped1 = grouped.agg([ 'mean','std'])
Category=All[CATEGORICALTYPE]
grouped2 = Category.groupby(Category.index.floor('5T'))
grouped2=grouped2.agg(lambda x: x.value_counts().index[0] if len(x.dropna())!=0 else np.nan)

grouped = grouped.merge( grouped2, axis=1)

AttributeError: Cannot access callable attribute 'merge' of 'DataFrameGroupBy' objects, try using the 'apply' method

有没有办法通过一行代码来避免这个问题:

grouped1 = grouped.agg(lambda x: [ 'mean','std'] if x.astype(float) else  (x.value_counts().index[0] if len(x.dropna())!=0 else np.nan) )

或根据他们的索引将两个组合并在一起并避免错误。

1 个答案:

答案 0 :(得分:1)

考虑join代替merge,默认情况下,这两个数据框之间的索引对齐。

final = grouped1.join(grouped2)

但是,您希望展平由遥测的多个聚合groupby产生的分层列,以避免出现意外结果的不同级别(会引发警告):

grouped1 = Telemetry.groupby(Telemetry.index.floor('5T')).agg([ 'mean','std'])

from itertools import product

newcols = [str(i[0])+'_'+i[1]
           for i in list(product(grouped1.columns.levels[0], grouped1.columns.levels[1]))]
grouped1.columns = newcols

下面用可重现的例子进行演示

数据

import numpy as np
import pandas as pd
import datetime as dt
import time

LETTERS = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')    
epoch_time = int(time.time())

np.random.seed(1001)
ALL = pd.DataFrame({'NUM1': np.random.randn(50)*100,
                    'NUM2': np.random.uniform(0,1,50),                   
                    'NUM3': np.random.randint(100, size=50),                                             
                    'CAT1': ["".join(np.random.choice(LETTERS,1)) for _ in range(50)],
                    'CAT2': ["".join(np.random.choice(['pandas', 'r', 'julia', 'sas', 'stata', 'spss'],1)) for _ in range(50)],              
                    'CAT3': ["".join(np.random.choice(['postgres', 'mysql', 'sqlite', 'oracle', 'sql server', 'db2'],1)) for _ in range(50)]}, 
                   index=[dt.datetime.fromtimestamp(np.random.randint(epoch_time - 5000, epoch_time)) for _ in range(50)])

聚合

from itertools import product

# NUMERIC COLS --------------------------------------------------
Telemetry = ALL.filter(regex='NUM', axis=1)

grouped1 = Telemetry.groupby(Telemetry.index.floor('5T')).agg([ 'mean','std'])

newcols = [str(i[0])+'_'+i[1]
           for i in list(product(grouped1.columns.levels[0], grouped1.columns.levels[1]))]
grouped1.columns = newcols

# CATEGORY COLS -------------------------------------------------
Category = ALL.filter(regex='CAT', axis=1)

grouped2 = Category.groupby(Category.index.floor('5T'))\
                   .agg(lambda x: x.value_counts().index[0] if len(x.dropna())!=0 else np.nan)

final = grouped1.join(grouped2)

输出

print(final)

#                       NUM1_mean    NUM1_std  NUM2_mean  NUM2_std  NUM3_mean   NUM3_std CAT1    CAT2     CATR3
# 2018-03-13 13:55:00  -17.516103   59.562954   0.530788  0.217159  67.000000  17.568912    I   julia    sqlite
# 2018-03-13 14:00:00   85.189272         NaN   0.842956       NaN  43.000000        NaN    Y     sas    oracle
# 2018-03-13 14:05:00  -16.833329  201.004717   0.737183  0.332332  55.500000  38.890873    M    spss  postgres
# 2018-03-13 14:10:00   84.936984   80.634218   0.754657  0.110415  80.600000  17.213367    V   stata    oracle
# 2018-03-13 14:15:00   99.512503   11.492072   0.521307  0.250584  23.500000  30.405592    E  pandas    sqlite
# 2018-03-13 14:20:00  -90.749756   65.721659   0.459464  0.377603  35.250000  30.192438    G     sas       db2
# 2018-03-13 14:25:00  -56.271685  104.440802   0.496268  0.348611  56.500000  28.310775    K    spss  postgres
# 2018-03-13 14:30:00   55.369341   50.215679   0.600296  0.399855  65.000000  41.004065    P       r       db2
# 2018-03-13 14:40:00  184.546043         NaN   0.892016       NaN  84.000000        NaN    W   julia       db2
# 2018-03-13 14:45:00  -93.886027   61.489475   0.498042  0.286001  48.000000  25.429641    W   stata  postgres
# 2018-03-13 14:50:00  122.819400         NaN   0.168059       NaN  41.000000        NaN    S   stata    sqlite
# 2018-03-13 14:55:00  -34.318532   40.225336   0.756454  0.335583  26.666667  21.385353    F   stata    sqlite
# 2018-03-13 15:00:00   -2.329881         NaN   0.894770       NaN  73.000000        NaN    Y     sas  postgres
# 2018-03-13 15:05:00  -86.408659   31.446422   0.618246  0.158136  52.000000  59.396970    G   julia  postgres
# 2018-03-13 15:10:00  -20.309460  121.773576   0.479996  0.394707  52.000000  42.585209    U     sas    oracle
# 2018-03-13 15:15:00   -5.493293  217.143835   0.478187  0.530773  59.500000  55.861436    E   stata  postgres