Question

我目前正在尝试为数据框创建函数，对我来说太复杂了。我有一个看起来像这样的数据框：

df1

     hour    production ....      
0     1          10
0     2          20
0     1          30
0     3          40
0     1          40
0     4          30
0     1          20
0     4          10

我正在尝试创建一个可以执行以下操作的函数：

按不同的hour分组数据
为每个production计算hour的90％置信区间
如果特定行的production值所属的hour超出90％置信区间，请通过创建新列将其标记为unusual

下面是我现在每个人小时执行上述操作的当前步骤：

计算置信区间

confidence = 0.90
data = df1['production ']
n = len(data)
m = mean(data)
std_err = sem(data)
h = std_err * t.ppf((1 + confidence) / 2, n - 1)
lower_interval = m - h
upper_interval = m + h

然后：

def confidence_interval(x):
if x['production'] > upper_interval  :
    return 1
if x['production'] < lower_interval :
    return 1
return 0

df1['unusual'] = df1.apply (lambda x: confidence_interval(x), axis=1)

我要以小时为单位对每个值进行此操作，而不是必须将所有结果合并到一个原始数据帧中。

有人可以帮助我创建一个可以立即完成以上所有功能的函数吗？我试了一下，但实在没办法。

Answer 1

创建自定义函数，并将GroupBy.transform与Series.between一起使用，并通过~反转掩码：

from scipy.stats import sem, t
from scipy import mean

def confidence_interval(data):
    confidence = 0.90
    n = len(data)
    m = mean(data)
    std_err = sem(data)
    h = std_err * t.ppf((1 + confidence) / 2, n - 1)
    lower_interval = m - h
    upper_interval = m + h
    #print (lower_interval ,upper_interval)
    return ~data.between(lower_interval, upper_interval, inclusive=False)

df1['new'] = df1.groupby('hour')['production'].transform(confidence_interval).astype(int)
print (df1)
   hour  production  new
0     1          10    0
0     2          20    1
0     1          30    0
0     3          40    1
0     1          40    0
0     4          30    0
0     1          20    0
0     4          10    0

函数中的函数，用于基于列值迭代每一行

1 个答案: