Question

我需要根据以下标准对一个pandas Dataframe进行分组，它就像是一个欧元聚合：

open = last where volume > 0, in case there is no entry with volume > 0 use overall last
high = max
low = min
last = last
volume = max

我目前对这些类型的操作（ohlc聚合）的实现是：

ohlc_dict = {
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum',
}

df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).agg(ohlc_dict)

我该如何解决这个问题？谢谢。

示例：

                     fi  ts     open     high      low    close  volume
datetime                                                               
2017-11-17 12:35:00   0   0  0.96214  0.96214  0.96214  0.96214       0
2017-11-17 12:35:00   0   0  0.96214  0.96214  0.96214  0.96214       0
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96213  0.96220  0.96213  0.96219      19
2017-11-17 12:35:00   0   0  0.96214  0.96220  0.96214  0.96219       0
2017-11-17 12:35:00   0   0  0.96214  0.96222  0.96214  0.96222       0
2017-11-17 12:35:00   0   0  0.96214  0.96222  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96222  0.96214  0.96221       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96223       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96221       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96220       0
2017-11-17 12:35:00   0   0  0.96213  0.96223  0.96213  0.96220      29
2017-11-17 12:35:00   0   0  0.96213  0.96223  0.96213  0.96220      29
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96221       0
2017-11-17 12:35:00   0   0  0.96214  0.96223  0.96214  0.96222       0

期望输出：

                     fi  ts     open     high      low    close  volume
datetime 
2017-11-17 12:35:00   0   0  0.96213  0.96223  0.96213  0.96222       29

其他信息：

有两个数据源可以通过“卷”值来识别：

a. Volume = 0 (more frequent, less reliable)
b. Volume > 0 (less frequent, more reliable)

类型'b。'更可靠，最好使用其开放值来输入'a'开放值。

至于最后一次聚合，说实话并不重要，其他聚合（first，max，min）也可以，因为open值是一分钟内的第一个引用值（在这个例子中），从不变化。

当与服务器的连接中断时，会出现错误值的问题。输入'a'数据无法解决这个问题并且会给我错误的值，类型'b'数据可以解决这个问题，并且会给我正确的值。

Answer 1

您可以使用boolean indexing表示最大音量，tail(1)表示打开里面的最后一个值，因为您有重复索引，即

ohlc_dict = {
   'high': 'max',
   'low': 'min',
   'close': 'last',
   'volume': 'max',
}
grp = df.groupby(pd.Grouper(freq='1Min',level=0,label='left'))
ndf = grp.agg(ohlc_dict)

ndf['open'] = grp['open','volume'].apply(lambda x : x[x['volume'] == x['volume'].max()].tail(1)['open'])

输出：

                        low  volume    close     high     open
datetime                                                       
2017-11-17 12:35:00  0.96213      29  0.96222  0.96223  0.96213

Answer 2

您可以先按last列的open汇总：

ohlc_dict = {
   'high': 'max',
   'low': 'min',
   'close': 'last',
   'open':'last',
   'volume':'sum'
}

g = df.groupby(pd.Grouper(freq='1Min',level=0,label='left'))
df2 = g.agg(ohlc_dict)
print (df2)
                         low    close     high     open  volume
datetime                                                       
2017-11-17 12:35:00  0.96213  0.96222  0.96223  0.96215      77

然后过滤掉所有0卷，并仅汇总open的最后一个值：

g1 = df[df['volume'] > 0].groupby(pd.Grouper(freq='1Min',level=0,label='left'))
df1 = g1['open'].last().reindex(df2.index)
print (df1)
datetime
2017-11-17 12:35:00    0.96213
Freq: T, Name: open, dtype: float64

上次将两个DataFrame合并为一个to_frame和combine_first：

df3 = df1.to_frame().combine_first(df2)
print (df3)
                       close     high      low     open  volume
datetime                                                       
2017-11-17 12:35:00  0.96222  0.96223  0.96213  0.96213    77.0

使用自定义函数和条件（更慢）：

def ohlc_func(x):
    a = x.loc[x['volume'] > 0, 'open'].tail(1)
    a = a.item() if len(a) == 1 else x['open'].tail(1)[0]
    b = x['high'].max()
    c = x['low'].min()
    d = x['close'].tail(1)[0]
    e = x['volume'].sum()        
    col = ['open','high','low','close','volume']
    return pd.Series([a,b,c,d,e], index=col)


df = df.groupby(pd.Grouper(freq='1Min',level=0,label='left')).apply(ohlc_func)
print (df)
                        open     high      low    close  volume
datetime                                                       
2017-11-17 12:35:00  0.96213  0.96223  0.96213  0.96222    77.0

使用来自不同列的条件语句进行分组

2 个答案: