Question

我只需要通过考虑给定百分位数范围内的值来估算熊猫DataFrameGroupBy的平均值。

例如，给出代码段

import numpy as np
import pandas as pd
a = np.matrix('1 1; 1 2; 1 4; 2 1; 2 2; 2 4')
data = pd.DataFrame(a)
groupby = data.groupby(0)
m1 = groupby.mean()

结果是

m1 =            1
      0          
      1  2.333333
      2  2.333333

但是，如果选择百分位数范围以排除最大值和最小值，则结果应该为

如何在估计平均值之前为每个组过滤任意百分位数范围之间的值？例如，仅考虑20％和80％之间的值。

Answer 1

您可以对np.percentile或pd.Series.quantile使用自定义函数。性能差异很小。下面的示例在计算分组均值时仅包括第20个百分点以上和第80个百分点以下的值。

<Style x:Key="{x:Static ToolBar.ButtonStyleKey}" TargetType="{x:Type Button}">
    <Setter Property="Control.Foreground" Value="{DynamicResource {x:Static SystemColors.ControlTextBrushKey}}"/>
    <Setter Property="Control.Padding" Value="2"/>
    <Setter Property="Control.BorderThickness" Value="1"/>
    <Setter Property="Control.Background" Value="Transparent"/>
    <Setter Property="Control.BorderBrush" Value="Transparent"/>
    <Setter Property="FrameworkElement.HorizontalAlignment" Value="Center"/>
    <Setter Property="FrameworkElement.VerticalAlignment" Value="Center"/>
    <Setter Property="Control.HorizontalContentAlignment" Value="Center"/>
    <Setter Property="Control.VerticalContentAlignment" Value="Center"/>
    <Setter Property="Control.Template">
        <Setter.Value>
            <ControlTemplate TargetType="{x:Type Button}">
                <Border Name="Bd" Background="{TemplateBinding Control.Background}"
                  BorderBrush="{TemplateBinding Control.BorderBrush}"
                  BorderThickness="{TemplateBinding Control.BorderThickness}"
                  Padding="{TemplateBinding Control.Padding}" SnapsToDevicePixels="true">
                    <ContentPresenter HorizontalAlignment="{TemplateBinding Control.HorizontalContentAlignment}"
                              VerticalAlignment="{TemplateBinding Control.VerticalContentAlignment}"
                              SnapsToDevicePixels="{TemplateBinding UIElement.SnapsToDevicePixels}"/>
                </Border>
                <ControlTemplate.Triggers>
                    <Trigger Property="UIElement.IsMouseOver" Value="true">
                        <Setter TargetName="Bd" Value="{StaticResource ƻ}" Property="Border.BorderBrush"/>
                        <Setter TargetName="Bd" Value="{StaticResource ƺ}" Property="Border.Background"/>
                    </Trigger>
                    <Trigger Property="UIElement.IsKeyboardFocused" Value="true">
                        <Setter TargetName="Bd" Value="{StaticResource ƻ}" Property="Border.BorderBrush"/>
                        <Setter TargetName="Bd" Value="{StaticResource ƺ}" Property="Border.Background"/>
                    </Trigger>
                    <Trigger Property="ButtonBase.IsPressed" Value="true">
                        <Setter TargetName="Bd" Value="{StaticResource ƾ}" Property="Border.BorderBrush"/>
                        <Setter TargetName="Bd" Value="{StaticResource ƽ}" Property="Border.Background"/>
                    </Trigger>
                    <Trigger Property="UIElement.IsEnabled" Value="false">
                        <Setter Value="{DynamicResource {x:Static SystemColors.GrayTextBrushKey}}" Property="Control.Foreground"/>
                    </Trigger>
                </ControlTemplate.Triggers>
            </ControlTemplate>
        </Setter.Value>
    </Setter>

Answer 2

您可以定义一个函数来计算数据框的平均值，然后使用apply方法。像这样：

def mean_percent(df,per1,per2):
    #Write meaningful code here

data = pd.DataFrame(a)
groupby = data.groupby(0)
m1 = groupby.apply(lambda df: mean_percent(df,20,80))

这将计算平均值，每组的平均值在20-80％范围内。如果您需要编写第一个功能的帮助，请随时在评论中提问，我将编辑此答案。

Answer 3

尝试

data.sort_values(by=1).groupby(by=0).agg(['first','last']).mean()

OR

data.sort_values(by=1).groupby(by=0).agg(['min','max']).mean()

Answer 4

一种方法是在使用groupby之前过滤数据帧。您可以按感兴趣的列对数据框进行排序，然后排除第一行和最后一行。

data = data.sort_values(1).iloc[1:-1,:]
groupby = data.groupby(0)
m1 = groupby.mean()

另一点说明：最佳做法是不要使用与“ groupby”之类的常用方法相同的变量名。如果您可以将其更改为其他名称，则强烈建议使用。

Answer 5

将np.percentile或np.quantile与groupby + apply一起使用：

a = np.matrix('1 1 2; 1 2 3; 1 4 4; 2 1 6; 2 2 8; 2 4 16;7 8 45;9 10 9;11 12 3')
df = pd.DataFrame(a,columns=['a','b','c'])
#drop column which is key for grouping
min_val,max_val = np.percentile(df.drop('a',1).values,[20,80],axis=0)
#alternative np.quantile(df.drop('a',1).values,[0.2,0.8],axis=0)
df1 = df.groupby('a').apply(lambda x: x[(x<max_val)&(x>min_val)].mean())

print(df1)
      b    c
a           
1   3.0  4.0
2   3.0  7.0
7   8.0  NaN
9   NaN  9.0
11  NaN  NaN

仅考虑百分位数范围内的值来估计DataFrameGroupBy的平均值

5 个答案: