熊猫groupby agg应用具有多个参数的函数

时间:2020-08-18 00:13:51

标签: pandas pandas-groupby aggregate apply

如何将函数应用于需要来自groupby数据帧的多个列的参数并返回两个定标器值的pandas groupby。

下面是可重复的示例。最后一行获取f_value

import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
import plotly.express as px

n=100
df = pd.DataFrame({
    'c': np.random.choice(['CATS', 'DOGS'], n),
    'x': np.random.choice(list('ABCDE'), n),
    'y': np.random.normal(5, 1, n)
})

signal = np.where(df['c'].eq('CATS') & df['x'].eq('A'), 1.1, 0)
df['y'] = df['y'] + signal

def get_ols_fp(df, x, y):
    formula = y + '~' + x
    model = ols(formula, df).fit()
    f_value = model.fvalue
    p_value = model.f_pvalue
    return (f_value, p_value)

# getting f_value and p_value works with a single series.
get_ols_fp(df[df['c'].eq('CATS')], 'x', 'y')

上面的代码可以正常工作并获取f_value和p_value。但是,以下操作无效。

# how could we run the get_ols with a groupby().agg() 
df.groupby('c').agg(get_ols_fp('x', 'y'))

在这种情况下,所需的输出将是每个级别的“ c”变量(“ CATTS”和“ DOGS”)每一行一个数据帧,一列用于p_value,另一列用于f_value。

2 个答案:

答案 0 :(得分:1)

我会做一些不同的事情。 我不知道这是否是最简单的方法,但是可以。

示例:

import pandas as pd
import numpy as np
from statsmodels.formula.api import ols

n=100
df = pd.DataFrame({
    'c': np.random.choice(['CATS', 'DOGS'], n),
    'x': np.random.choice(list('ABCDE'), n),
    'y': np.random.normal(5, 1, n)
})

signal = np.where(df['c'].eq('CATS') & df['x'].eq('A'), 1.1, 0)
df['y'] = df['y'] + signal

def get_ols_fp(df, x, y):
    formula = y + '~' + x
    model = ols(formula, df).fit()
    f_value = model.fvalue
    p_value = model.f_pvalue
    return (f_value, p_value)

# getting f_value and p_value works with a single series.
# get_ols_fp(df[df['c'].eq('CATS')], 'x', 'y')

df_result = pd.DataFrame([], columns = ["c", "f_value", "p_value"])
for c, dd in df.groupby(['c']):
    v = get_ols_fp(dd, 'x', 'y')
    df_result.loc[len(df_result)] = [c, *v]

df_result

enter image description here

答案 1 :(得分:1)

这有效:

def get_ols_fp(df, x=None, y=None):
    formula = y + '~' + x 
    model = ols(formula, df).fit()
    f_value = model.fvalue
    p_value = model.f_pvalue
    return pd.Series([f_value, p_value], index=['f_value', 'p_value'])

df.groupby('c').apply(get_ols_fp, x='x', y = 'y')