Question

我使用statsmodels创建一些回归输出：

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
import numpy as np 
import pandas as pd 

x1 = pd.Series(np.random.randn(2000))
x2 = pd.Series(np.random.randn(2000))
aa_milne_arr = ['a', 'b', 'c', 'd', "e", "f", "g", "h", "i"]
dummy = pd.Series(np.random.choice(aa_milne_arr, 2000,))
depen = pd.Series(np.random.randn(2000))
df = pd.DataFrame({"y": depen, "x1": x1, "x2": x2, "dummy": dummy})
df['const'] = 1
df['xsqr'] = df['x1']**2  
mod = smf.ols('y ~ x1 + x2 + dummy', data=df)
mod2 = smf.ols('y ~ x1 + x2 + xsqr + dummy', data=df)
res = mod.fit()
res2 = mod2.fit()

print (summary_col([res,res2],stars=True,float_format='%0.3f',
                  model_names=['one\n(0)','two\n(1)'],
                  info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.2f}".format(x.rsquared)}))

它工作得很好，但我有一个包含许多虚拟对象的大数据集（比示例中的方式更多）。因此，我想从摘要输出中排除虚拟变量（而不是从回归本身中排除）。它在某种程度上可能吗？

Answer 1

快速而肮脏的方法是首先在最终dummy中找到这些summary_col索引，然后避免打印它们：

summary = summary_col(
    [res,res2],stars=True,float_format='%0.3f',
    model_names=['one\n(0)','two\n(1)'],
    info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
    'R2':lambda x: "{:.2f}".format(x.rsquared)})

# As string
# summary_str = str(summary).split('\n')
# LaTeX format
summary_str = summary.as_latex().split('\n')

# Find dummy indexes
dummy_idx = []
for i, li in enumerate(summary_str):
    if li.startswith('dummy'):
        dummy_idx.append(i)
        dummy_idx.append(i + 1)

# Print summary avoiding dummy indexes
for i, li in enumerate(summary_str):
    if i not in dummy_idx:
        print(li)

它不漂亮，但它有效。使用字符串格式：

==========================
             one     two  
             (0)     (1)  
--------------------------
Intercept  0.029   -0.000 
           (0.065) (0.068)
x1         0.023   0.025  
           (0.022) (0.022)
x2         -0.014  -0.014 
           (0.022) (0.022)
xsqr               0.024  
                   (0.016)
N          2000    2000   
R2         0.00    0.00   
==========================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

使用LaTeX格式：

\begin{table}
\caption{}
\begin{center}
\begin{tabular}{lcc}
\hline
           &   one   &   two    \\
           &   (0)   &   (1)    \\
\hline
\hline
\end{tabular}
\begin{tabular}{lll}
Intercept  & 0.070   & 0.067    \\
           & (0.069) & (0.071)  \\
x1         & 0.001   & 0.001    \\
           & (0.022) & (0.022)  \\
x2         & -0.024  & -0.025   \\
           & (0.022) & (0.022)  \\
xsqr       &         & 0.003    \\
           &         & (0.015)  \\
N          & 2000    & 2000     \\
R2         & 0.01    & 0.01     \\
\hline
\end{tabular}
\end{center}
\end{table}

Answer 2

我会在 " WHERE dni_competidor = '"+wher_combo.getSelectedItem().toString()+"' "; 中使用 regressor_order 参数，它允许您指定首先显示哪些回归量（如果指定 summary_col，则完全省略）。

示例：

drop_omitted=True

Python：不要在statsmodels摘要中显示虚拟对象

2 个答案: