我有一个数据框,其中一列是分类数据,其余是浮点数。我根据数据类型将这两者分开。两个数据帧都有时间戳,时间戳是它们的索引。
我正在尝试聚合数字数据的统计数据,也是分类数据最常用的标签,持续5分钟。我分别处理每种类型,但我不能将两个组合在一起。
Telemetry=All[FLOATTYPE]
grouped = Telemetry.groupby(Telemetry.index.floor('5T'))
# computing various stats
grouped1 = grouped.agg([ 'mean','std'])
Category=All[CATEGORICALTYPE]
grouped2 = Category.groupby(Category.index.floor('5T'))
grouped2=grouped2.agg(lambda x: x.value_counts().index[0] if len(x.dropna())!=0 else np.nan)
grouped = grouped.merge( grouped2, axis=1)
AttributeError: Cannot access callable attribute 'merge' of 'DataFrameGroupBy' objects, try using the 'apply' method
有没有办法通过一行代码来避免这个问题:
grouped1 = grouped.agg(lambda x: [ 'mean','std'] if x.astype(float) else (x.value_counts().index[0] if len(x.dropna())!=0 else np.nan) )
或根据他们的索引将两个组合并在一起并避免错误。
答案 0 :(得分:1)
考虑join
代替merge
,默认情况下,这两个数据框之间的索引对齐。
final = grouped1.join(grouped2)
但是,您希望展平由遥测的多个聚合groupby
产生的分层列,以避免出现意外结果的不同级别(会引发警告):
grouped1 = Telemetry.groupby(Telemetry.index.floor('5T')).agg([ 'mean','std'])
from itertools import product
newcols = [str(i[0])+'_'+i[1]
for i in list(product(grouped1.columns.levels[0], grouped1.columns.levels[1]))]
grouped1.columns = newcols
下面用可重现的例子进行演示
数据
import numpy as np
import pandas as pd
import datetime as dt
import time
LETTERS = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
epoch_time = int(time.time())
np.random.seed(1001)
ALL = pd.DataFrame({'NUM1': np.random.randn(50)*100,
'NUM2': np.random.uniform(0,1,50),
'NUM3': np.random.randint(100, size=50),
'CAT1': ["".join(np.random.choice(LETTERS,1)) for _ in range(50)],
'CAT2': ["".join(np.random.choice(['pandas', 'r', 'julia', 'sas', 'stata', 'spss'],1)) for _ in range(50)],
'CAT3': ["".join(np.random.choice(['postgres', 'mysql', 'sqlite', 'oracle', 'sql server', 'db2'],1)) for _ in range(50)]},
index=[dt.datetime.fromtimestamp(np.random.randint(epoch_time - 5000, epoch_time)) for _ in range(50)])
聚合
from itertools import product
# NUMERIC COLS --------------------------------------------------
Telemetry = ALL.filter(regex='NUM', axis=1)
grouped1 = Telemetry.groupby(Telemetry.index.floor('5T')).agg([ 'mean','std'])
newcols = [str(i[0])+'_'+i[1]
for i in list(product(grouped1.columns.levels[0], grouped1.columns.levels[1]))]
grouped1.columns = newcols
# CATEGORY COLS -------------------------------------------------
Category = ALL.filter(regex='CAT', axis=1)
grouped2 = Category.groupby(Category.index.floor('5T'))\
.agg(lambda x: x.value_counts().index[0] if len(x.dropna())!=0 else np.nan)
final = grouped1.join(grouped2)
输出
print(final)
# NUM1_mean NUM1_std NUM2_mean NUM2_std NUM3_mean NUM3_std CAT1 CAT2 CATR3
# 2018-03-13 13:55:00 -17.516103 59.562954 0.530788 0.217159 67.000000 17.568912 I julia sqlite
# 2018-03-13 14:00:00 85.189272 NaN 0.842956 NaN 43.000000 NaN Y sas oracle
# 2018-03-13 14:05:00 -16.833329 201.004717 0.737183 0.332332 55.500000 38.890873 M spss postgres
# 2018-03-13 14:10:00 84.936984 80.634218 0.754657 0.110415 80.600000 17.213367 V stata oracle
# 2018-03-13 14:15:00 99.512503 11.492072 0.521307 0.250584 23.500000 30.405592 E pandas sqlite
# 2018-03-13 14:20:00 -90.749756 65.721659 0.459464 0.377603 35.250000 30.192438 G sas db2
# 2018-03-13 14:25:00 -56.271685 104.440802 0.496268 0.348611 56.500000 28.310775 K spss postgres
# 2018-03-13 14:30:00 55.369341 50.215679 0.600296 0.399855 65.000000 41.004065 P r db2
# 2018-03-13 14:40:00 184.546043 NaN 0.892016 NaN 84.000000 NaN W julia db2
# 2018-03-13 14:45:00 -93.886027 61.489475 0.498042 0.286001 48.000000 25.429641 W stata postgres
# 2018-03-13 14:50:00 122.819400 NaN 0.168059 NaN 41.000000 NaN S stata sqlite
# 2018-03-13 14:55:00 -34.318532 40.225336 0.756454 0.335583 26.666667 21.385353 F stata sqlite
# 2018-03-13 15:00:00 -2.329881 NaN 0.894770 NaN 73.000000 NaN Y sas postgres
# 2018-03-13 15:05:00 -86.408659 31.446422 0.618246 0.158136 52.000000 59.396970 G julia postgres
# 2018-03-13 15:10:00 -20.309460 121.773576 0.479996 0.394707 52.000000 42.585209 U sas oracle
# 2018-03-13 15:15:00 -5.493293 217.143835 0.478187 0.530773 59.500000 55.861436 E stata postgres