from patsy import *
from pandas import *
dta = DataFrame([["lo", 1],["hi", 2.4],["lo", 1.2],["lo", 1.4],["very_high",1.8]], columns=["carbs", "score"])
dmatrix("carbs + score", dta)
DesignMatrix with shape (5, 4)
Intercept carbs[T.lo] carbs[T.very_high] score
1 1 0 1.0
1 0 0 2.4
1 1 0 1.2
1 1 0 1.4
1 0 1 1.8
Terms:
'Intercept' (column 0), 'carbs' (columns 1:3), 'score' (column 3)
问题:而不是指定"名称"使用Designinfo的列(这基本上使我的代码不太可重复使用),我是否可以阅读此DesignMatrix给出的名称,以便我可以在以后将其提供给DataFrame,而无需预先知道&#34 ;参考水平/控制组"水平是?
即。当我做 dmatrix(" C(碳水化合物,治疗(参考=' lo'))+得分",dta)
"""
# How can I get something like this with dmatrix's output without hardcoding ?
names = obtained from dmatrix's output above
This should give names = ['Intercept' ,'carbs[T.lo]', 'carbs[T.very_high]', 'score']
"""
g=DataFrame(dmatrix("carbs + score", dta),columns=names)
Intercept carbs[T.lo] carbs[T.very_high] score
0 1 2 3
0 1 1 0 1.0
1 1 0 0 2.4
2 1 1 0 1.2
3 1 1 0 1.4
4 1 0 1 1.8
type(g)=<class 'pandas.core.frame.DataFrame'>
所以g将是变换后的数据帧,我可以进行逻辑建模,而无需保留列名称的记录(或其硬编码)。他们的参考水平。
答案 0 :(得分:3)
我认为您要查找的信息位于design_info.column_names
:
>>> dm = dmatrix("carbs + score", dta)
>>> dm.design_info
DesignInfo(['Intercept', 'carbs[T.lo]', 'carbs[T.very_high]', 'score'],
term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('carbs')]), slice(1, 3, None)), (Term([EvalFactor('score')]), slice(3, 4, None))]),
builder=<patsy.build.DesignMatrixBuilder at 0xb03f8cc>)
>>> dm.design_info.column_names
['Intercept', 'carbs[T.lo]', 'carbs[T.very_high]', 'score']
等等
>>> DataFrame(dm, columns=dm.design_info.column_names)
Intercept carbs[T.lo] carbs[T.very_high] score
0 1 1 0 1.0
1 1 0 0 2.4
2 1 1 0 1.2
3 1 1 0 1.4
4 1 0 1 1.8
[5 rows x 4 columns]