熊猫无法使用DataFrameGroupBy.filter过滤空组

时间:2019-02-04 21:38:56

标签: python pandas pandas-groupby

我将具有日期时间索引的数据框归为10分钟存储桶。然后,我要检查每个存储区的长度,并丢弃数量少于最大存储区数量的存储区。

但是,

Groupby会继续创建一个不包含任何元素的空组,因此不会将其删除。

看下面的代码

import pandas as pd
import numpy as np
import datetime as dt

"Generate test dataframe"
rng = pd.date_range('2018-11-26 16:17:43.510000', periods=90000, freq='0.04S')
df = pd.DataFrame({'a':np.random.randn(len(rng)),'b':np.random.randn(len(rng))}, index=rng)

"Set interval and start time of the buckets"
interval = dt.timedelta(minutes=10)
t0 = df.index[0]
base = t0.minute + (t0.second +t0.microsecond/1e6)/60

"Group df"
groups = df.groupby(pd.Grouper(freq=interval, base=base))

print(len(groups)) 
#7

print(groups.size())

#2018-11-26 16:17:43.510    15000
#2018-11-26 16:27:43.510    15000
#2018-11-26 16:37:43.510    15000
#2018-11-26 16:47:43.510    15000
#2018-11-26 16:57:43.510    15000
#2018-11-26 17:07:43.510    15000
#2018-11-26 17:17:43.510        0 <- I want to remove this group

"Remove all buckets with a lower number of samples"
maxSize = max(groups.size())
def ismaxlen(x):
    print(len(x) == maxSize)
    return len(x) == maxSize

df = groups.filter(ismaxlen) #Prints 6 times True and one time False
                             #This should have removed the last group!
"Group again data"
groups = df.groupby(pd.Grouper(freq=interval, base=base))

print(len(groups)) 
#Prints again 7!! The 7th ghost group is still there

print(groups.size())

#2018-11-26 16:17:43.510    15000
#2018-11-26 16:27:43.510    15000
#2018-11-26 16:37:43.510    15000
#2018-11-26 16:47:43.510    15000
#2018-11-26 16:57:43.510    15000
#2018-11-26 17:07:43.510    15000
#2018-11-26 17:17:43.510        0  <- This group is still here



#Some more weirdness...

print(groups.groups)

#{Timestamp('2018-11-26 16:17:43.510000'): 15000,
# Timestamp('2018-11-26 16:27:43.510000'): 30000,
# Timestamp('2018-11-26 16:37:43.510000'): 45000,
# Timestamp('2018-11-26 16:47:43.510000'): 60000,
# Timestamp('2018-11-26 16:57:43.510000'): 75000,
# Timestamp('2018-11-26 17:07:43.510000'): 90000, <-
# Timestamp('2018-11-26 17:17:43.510000'): 90000} <-last two groups ends at the same index!

print(df.index[-1])
#2018-11-26 17:17:43.470000
# Last data has an index < than last group. Last group should not even exist! 
#Why is a group starting at 17:43.51 created if the last sample is at 17:43.470000

print(len(groups.indices)) 
#Prints 6. I have 7 groups, but only 6 indices! 7th group doesn't even exist!

如何避免这种行为?为什么会这样呢?这是错误吗?

1 个答案:

答案 0 :(得分:0)

此问题是由base选项引起的。根据{{​​1}}的值,groupby无法创建正确数量的组。

由于最后一个群组没有成员,base不会删除任何内容,第二个群组仅重复第一个群组的内容即可。

此问题仅发生在pandas版本<0.24的Python 3中。

这可以通过复制来实现

filter

这会在case1中生成2个组(其中一个为空),而在case2中仅生成1个。

此问题已在熊猫0.24中得到解决,并在此处进行了讨论: https://github.com/pandas-dev/pandas/issues/25161