Question

我有一个带有'data'和'cond'（ - ition）列的pandas Dataframe。我需要'cond'中CONTINUOUS True对象数最多的行的平均值（数据列）。

    Example DataFrame:

        cond  data
    0   True  0.20
    1  False  0.30
    2   True  0.90
    3   True  1.20
    4   True  2.30
    5  False  0.75
    6   True  0.80

    Result = 1.466, which is the mean value of row-indexes 2:4 with 3 True

我无法使用groupby或pivot方法找到“矢量化”解决方案。所以我写了一个循环行的func。不幸的是，这需要大约一个小时的100万行，这是很长的路。不幸的是，@ jit装饰并没有显着减少持续时间。

我要分析的数据来自一年内的监控项目，我每隔3小时就有一个数百万行的DataFrame。因此，大约有3000个这样的文件。

有效的解决方案非常重要。我也非常感谢numpy的解决方案。

Answer 1

使用Calculating the number of specific consecutive equal values in a vectorized way in pandas的方法：

df['data'].groupby((df['cond'] != df['cond'].shift()).cumsum()).agg(['count', 'mean'])[lambda x: x['count']==x['count'].max()]
Out: 
      count      mean
cond                 
3         3  1.466667

通过callable索引需要0.18.0，对于早期版本，您可以执行以下操作：

res = df['data'].groupby((df['cond'] != df['cond'].shift()).cumsum()).agg(['count', 'mean'])

res[res['count'] == res['count'].max()]
Out: 
      count      mean
cond                 
3         3  1.466667

工作原理：

第一部分df['cond'] != df['cond'].shift()返回一个布尔数组：

df['cond'] != df['cond'].shift()
Out: 
0     True
1     True
2     True
3    False
4    False
5     True
6     True
Name: cond, dtype: bool

每当行与上面的行相同时，该值为False。这意味着如果你采用累积总和，这些行（连续的）将具有相同的数字：

(df['cond'] != df['cond'].shift()).cumsum()
Out: 
0    1
1    2
2    3
3    3
4    3
5    4
6    5
Name: cond, dtype: int32

由于groupby接受任何要分组的系列（没有必要传递一个列，你可以传递任意列表），这可以用来对结果进行分组。 .agg(['count', 'mean']部分只是为每个组提供相应的计数和方法，最后选择计数最高的那个。

请注意，这也会将连续的False组合在一起。如果您只想考虑连续的True，可以将分组系列更改为：

((df['cond'] != df['cond'].shift()) | (df['cond'] != True)).cumsum()

由于我们在条件为True时需要False，因此条件变得不等于 OR 下面的行而不是True＆＃39;。所以原来的行会改为：

df['data'].groupby(((df['cond'] != df['cond'].shift()) | (df['cond'] != True)).cumsum()).agg(['count', 'mean'])[lambda x: x['count']==x['count'].max()]

Answer 2

这是一种基于NumPy的方法 -

# Extract the relevant cond column as a 1D NumPy array and pad with False at
# either ends, as later on we would try to find the start (rising edge) 
# and stop (falling edge) for each interval of True values
arr = np.concatenate(([False],df.cond.values,[False]))

# Determine the rising and falling edges as start and stop 
start = np.nonzero(arr[1:] > arr[:-1])[0]
stop = np.nonzero(arr[1:] < arr[:-1])[0]

# Get the interval lengths and determine the largest interval ID
maxID = (stop - start).argmax()

# With maxID get max interval range and thus get mean on the second col
out = df.data.iloc[start[maxID]:stop[maxID]].mean()

运行时测试

作为功能的方法 -

def pandas_based(df): # @ayhan's soln
    res = df['data'].groupby((df['cond'] != df['cond'].shift()).\
                                cumsum()).agg(['count', 'mean'])
    return res[res['count'] == res['count'].max()]

def numpy_based(df):
    arr = np.concatenate(([False],df.cond.values,[False]))
    start = np.nonzero(arr[1:] > arr[:-1])[0]
    stop = np.nonzero(arr[1:] < arr[:-1])[0]
    maxID = (stop - start).argmax()
    return df.data.iloc[start[maxID]:stop[maxID]].mean()

计时 -

In [208]: # Setup dataframe
     ...: N = 1000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [209]: %timeit pandas_based(df)
100 loops, best of 3: 2.61 ms per loop

In [210]: %timeit numpy_based(df)
1000 loops, best of 3: 215 µs per loop

In [211]: # Setup dataframe
     ...: N = 10000  # Datasize
     ...: df = pd.DataFrame(np.random.rand(N),columns=['data'])
     ...: df['cond'] = np.random.rand(N)>0.3 # To have 70% True values
     ...: 

In [212]: %timeit pandas_based(df)
100 loops, best of 3: 4.12 ms per loop

In [213]: %timeit numpy_based(df)
1000 loops, best of 3: 331 µs per loop

确定最大数量的CONTINUOUS cond = True的“数据”的平均值

2 个答案: