如何将函数应用于需要来自groupby数据帧的多个列的参数并返回两个定标器值的pandas groupby。
下面是可重复的示例。最后一行获取f_value
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
import plotly.express as px
n=100
df = pd.DataFrame({
'c': np.random.choice(['CATS', 'DOGS'], n),
'x': np.random.choice(list('ABCDE'), n),
'y': np.random.normal(5, 1, n)
})
signal = np.where(df['c'].eq('CATS') & df['x'].eq('A'), 1.1, 0)
df['y'] = df['y'] + signal
def get_ols_fp(df, x, y):
formula = y + '~' + x
model = ols(formula, df).fit()
f_value = model.fvalue
p_value = model.f_pvalue
return (f_value, p_value)
# getting f_value and p_value works with a single series.
get_ols_fp(df[df['c'].eq('CATS')], 'x', 'y')
上面的代码可以正常工作并获取f_value和p_value。但是,以下操作无效。
# how could we run the get_ols with a groupby().agg()
df.groupby('c').agg(get_ols_fp('x', 'y'))
在这种情况下,所需的输出将是每个级别的“ c”变量(“ CATTS”和“ DOGS”)每一行一个数据帧,一列用于p_value,另一列用于f_value。
答案 0 :(得分:1)
我会做一些不同的事情。 我不知道这是否是最简单的方法,但是可以。
示例:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
n=100
df = pd.DataFrame({
'c': np.random.choice(['CATS', 'DOGS'], n),
'x': np.random.choice(list('ABCDE'), n),
'y': np.random.normal(5, 1, n)
})
signal = np.where(df['c'].eq('CATS') & df['x'].eq('A'), 1.1, 0)
df['y'] = df['y'] + signal
def get_ols_fp(df, x, y):
formula = y + '~' + x
model = ols(formula, df).fit()
f_value = model.fvalue
p_value = model.f_pvalue
return (f_value, p_value)
# getting f_value and p_value works with a single series.
# get_ols_fp(df[df['c'].eq('CATS')], 'x', 'y')
df_result = pd.DataFrame([], columns = ["c", "f_value", "p_value"])
for c, dd in df.groupby(['c']):
v = get_ols_fp(dd, 'x', 'y')
df_result.loc[len(df_result)] = [c, *v]
df_result
答案 1 :(得分:1)
这有效:
def get_ols_fp(df, x=None, y=None):
formula = y + '~' + x
model = ols(formula, df).fit()
f_value = model.fvalue
p_value = model.f_pvalue
return pd.Series([f_value, p_value], index=['f_value', 'p_value'])
df.groupby('c').apply(get_ols_fp, x='x', y = 'y')