在ols.param中获取列名和coeff的列表

时间:2018-09-03 08:20:02

标签: python pandas linear-regression least-squares

我正在针对两个数据帧使用OLS:

gab = ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit() 

其中

only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column

当我尝试提取某些东西(例如参数或pvals)时,我得到的是这样的东西:

In [3]: gab.params
Out[3]: 
Intercept             2.687598e+06
all_but_volume[0]     5.500544e+01
all_but_volume[1]     2.696902e+02
all_but_volume[2]     3.389568e+04
all_but_volume[3]    -2.385838e+04
all_but_volume[4]     5.419860e+02
all_but_volume[5]     3.815161e+02
all_but_volume[6]    -2.281344e+04
all_but_volume[7]     1.794128e+04
...
all_but_volume[22]    1.374321e+00

由于gab.params在LHS中提供了23个值,而all_but_volume具有23个列,所以我希望是否有一种方法可以获取带有列名的参数列表/ zip,而不是all_but_volume[i]

就像

TMC     9.801195e+01
TAC     2.214464e+02
...

我尝试过的方法: 删除all_but_volume并仅使用data_p.iloc[:, 1:data_p.shape[1]]

不起作用:

...
data_p.iloc[:, 1:data_p.shape[1]][21]    2.918531e+04
data_p.iloc[:, 1:data_p.shape[1]][22]    1.395342e+00

编辑: 样本数据:

data_p.iloc[1:5,:]
Out[31]: 
          Volume             A              B                  C\
1  569886.171878    759.089217     272.446022           4.163908   
2  561695.886128    701.165406     330.301260           4.136530   
3  627221.486089    377.746089     656.838394           4.130720   
4  625181.750625    361.489041     670.575110           4.134467   

                          D         E        F      G      H     I  \
1                  1.000842  12993.06  3371.28  236.90  4.92  6.13   
2                  0.981514  13005.44  3378.69  236.94  4.92  6.13   
3                  0.836920  13017.22  3384.47  236.98  4.93  6.13   
4                  0.810541  13028.56  3388.85  237.01  4.94  6.13   

                          J               K       L       M           N  \
1      ...                0               0       0        0          0   
2      ...                0               0       0        0          0   
3      ...                0               0       0        0          0   
4      ...                0               0       0        0          0   

           O             P     Q             R   S  
1          0             0     0             1   9202.171648  
2          0             0     0             0   4381.373520  
3          0             0     0             0 -13982.443554  
4          0             0     0             0 -22878.843149

only_volume是第一列“卷” all_but_volume是除“ volume”以外的所有列

1 个答案:

答案 0 :(得分:2)

您可以使用DataFrame构造函数或rename,因为gab.paramsSeries

示例

np.random.seed(2018)

import statsmodels.formula.api as sm
data_p = pd.DataFrame(np.random.rand(10, 5), columns=['Volume','A','B','C','D'])
print (data_p)
     Volume         A         B         C         D
0  0.882349  0.104328  0.907009  0.306399  0.446409
1  0.589985  0.837111  0.697801  0.802803  0.107215
2  0.757093  0.999671  0.725931  0.141448  0.356721
3  0.942704  0.610162  0.227577  0.668732  0.692905
4  0.416863  0.171810  0.976891  0.330224  0.629044
5  0.160611  0.089953  0.970822  0.816578  0.571366
6  0.345853  0.403744  0.137383  0.900934  0.933936
7  0.047377  0.671507  0.034832  0.252691  0.557125
8  0.525823  0.352968  0.092983  0.304509  0.862430
9  0.716937  0.964071  0.539702  0.950540  0.667982

only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
gab = sm.ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit() 
print (gab.params)
Intercept            0.077570
all_but_volume[0]    0.395072
all_but_volume[1]    0.313150
all_but_volume[2]   -0.100752
all_but_volume[3]    0.247532
dtype: float64

print (type(gab.params))
<class 'pandas.core.series.Series'>

df = pd.DataFrame({'cols':data_p.columns[1:], 'par': gab.params.values[1:]})
print (df)
  cols       par
0    A  0.395072
1    B  0.313150
2    C -0.100752
3    D  0.247532

如果要返回Series

s = gab.params.rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
Volume    0.077570
A         0.395072
B         0.313150
C        -0.100752
D         0.247532
dtype: float64

Series,不带第一个值:

s = gab.params.iloc[1:].rename(dict(zip(gab.params.index, data_p.columns)))
print (s)

A    0.395072
B    0.313150
C   -0.100752
D    0.247532
dtype: float64