我需要帮助理解在Python中运行OLS(或任何机器学习)。我已经安装了所有相关的包,即。熊猫,numpy,statsmodels,scipy等...
这是我的基本示例:
df3= DataFrame({'revenue':[5,7,4,5,3,6,4,7,4,8,3,4],'cost':[2,4,4,3,6,7,5,4,7,23,4,7], 'overhead':[3,4,5,6,4,3,4,5,4,3,4,5]})
df3
df3.loc[0,'cost'] = 4
df3
df3.loc[12]=[1,5,8]
df3
好了,因为我的DataFrame中有其他行我不想只是将独立变量和因变量复制并传递给我的回归公式,这就是
OLS回归公式
df3= pd.DataFrame({"cost":[#Numbers would go here], "overhead":[#Numbers would go here], "revenue": [#Numbers would go here]})
reg = ols (y=df3["cost"], x=df3[["overhead","revenue"]])
reg
print(df3.to_csv(columns=['cost'], sep='\t', index=False))
所以我使用这个csv公式从DataFrame中获取单个列,以便我可以将它们复制到excel中,然后将它们复制回我的回归公式中来解决。但是,如果我只想使用Python而不必在它和其他软件之间来回复制和粘贴,该怎么办?
有没有其他软件可以在OLS回归公式中引用我的"成本","开销"和"收入"数据而不必明确键入每个单独的数字?
答案 0 :(得分:0)
所以回答我自己的问题,这就是我所做的。 我使用矩阵函数并将我的DataFrame转换为数组
numpyMatrix = df3.as_matrix()
numpyMatrix
KK = np.array(numpyMatrix)
KK
然后打印出每列数据以复制并粘贴到我自己的OLS公式
中KK [:,0]
KK [:,1]
KK [:,2]
df3 = pd.DataFrame({“cost”:[4,4,4,3,6,7,5,4,7,23,4,7,1],“开销”:[3,4 ,5,6,4,3,4,5,4,3,4,5,5],“收入”:[5,7,4,5,3,6,4,7,4,8,3 ,4,8]})
reg = ols(y = df3 [“cost”],x = df3 [[“overhead”,“revenue”]])
REG
您只需将数组输出复制并粘贴到OLS公式中即可解决。 我不认为这是通过远景解决这个问题的最好方法,但至少我不需要使用任何其他软件。
再次感谢Stackoverflow用户, 我希望听到你的意见
这是完整的打印输出。对不起任何困惑 谢谢你的时间
import pandas as pd
from pandas import DataFrame
print ('Pandas Version:' + pd.__version__)
from pandas.stats.api import ols
import numpy as np
print ('numpy Version:' + np.__version__)
from numpy import array
from numpy import array
from numpy import mean
from numpy import median
from numpy import std
from numpy import var
from numpy import amin
from numpy import amax
from numpy import nanmin
from numpy import nanmax
from numpy import ptp
from numpy import percentile
from numpy import average
from numpy import nanmean
from numpy import nanstd
from numpy import nanvar
from numpy import corrcoef
from numpy import correlate
from numpy import cov
from numpy import histogram
from numpy import histogram2d
from numpy import histogramdd
from numpy import bincount
from numpy import digitize
import collections
import math
import scipy.stats
import statsmodels.api as sm
from scipy import stats
import statsmodels as sm
import pylab as pl
from numpy.random import rand
from numpy.random import randn
from numpy.random import randint
from numpy.random import random_integers
from numpy.random import random_sample
from numpy.random import random
from numpy.random import ranf
from numpy.random import sample
from numpy.random import choice
from numpy.random import bytes
from numpy.random import shuffle
from numpy.random import permutation
from numpy.random import beta
from numpy.random import binomial
from numpy.random import chisquare
from numpy.random import dirichlet
from numpy.random import exponential
from numpy.random import f
from numpy.random import gamma
from numpy.random import geometric
from numpy.random import gumbel
from numpy.random import hypergeometric
from numpy.random import laplace
from numpy.random import logistic
from numpy.random import lognormal
from numpy.random import logseries
from numpy.random import multinomial
from numpy.random import multivariate_normal
from numpy.random import negative_binomial
from numpy.random import noncentral_chisquare
from numpy.random import noncentral_f
from numpy.random import normal
from numpy.random import pareto
from numpy.random import poisson
from numpy.random import power
from numpy.random import rayleigh
from numpy.random import standard_cauchy
from numpy.random import standard_exponential
from numpy.random import standard_gamma
from numpy.random import standard_normal
from numpy.random import standard_t
from numpy.random import triangular
from numpy.random import uniform
from numpy.random import vonmises
from numpy.random import wald
from numpy.random import weibull
from numpy.random import zipf
from numpy.random import RandomState
from numpy.random import seed
from numpy.random import get_state
from numpy.random import set_state
from __future__ import print_function
import numpy as np
import statsmodels.api as sm
from scipy import stats
from matplotlib import pyplot as plt
import statsmodels.api as sm
from numpy import array
from numpy import mean
from numpy import median
import collections
import math
from pandas.stats.api import ols
df3= DataFrame({'revenue':[5,7,4,5,3,6,4,7,4,8,3,4],'cost':[2,4,4,3,6,7,5,4,7,23,4,7], 'overhead':[3,4,5,6,4,3,4,5,4,3,4,5]})
df3
Out[62]:
cost overhead revenue
0 2 3 5
1 4 4 7
2 4 5 4
3 3 6 5
4 6 4 3
5 7 3 6
6 5 4 4
7 4 5 7
8 7 4 4
9 23 3 8
10 4 4 3
11 7 5 4
In [63]:
df3.loc[0,'cost'] = 4
df3
Out[63]:
cost overhead revenue
0 4 3 5
1 4 4 7
2 4 5 4
3 3 6 5
4 6 4 3
5 7 3 6
6 5 4 4
7 4 5 7
8 7 4 4
9 23 3 8
10 4 4 3
11 7 5 4
In [64]:
df3.loc[12]=[1,5,8]
df3
Out[64]:
cost overhead revenue
0 4 3 5
1 4 4 7
2 4 5 4
3 3 6 5
4 6 4 3
5 7 3 6
6 5 4 4
7 4 5 7
8 7 4 4
9 23 3 8
10 4 4 3
11 7 5 4
12 1 5 8
In [30]:
df3.iloc[:,[0]]
df3.iloc[:,[1]]
df3.iloc[:,[2]]
Out[30]:
revenue
0 5
1 7
2 4
3 5
4 3
5 6
6 4
7 7
8 4
9 8
10 3
11 4
12 8
In [72]:
In [70]:
jk=df3.iloc[:,[0]]
df3.ix[:,1]
Out[70]:
0 3
1 4
2 5
3 6
4 4
5 3
6 4
7 5
8 4
9 3
10 4
11 5
12 5
Name: overhead, dtype: int64
In [80]:
numpyMatrix = df3.as_matrix()
numpyMatrix
Out[80]:
array([[ 4, 3, 5],
[ 4, 4, 7],
[ 4, 5, 4],
[ 3, 6, 5],
[ 6, 4, 3],
[ 7, 3, 6],
[ 5, 4, 4],
[ 4, 5, 7],
[ 7, 4, 4],
[23, 3, 8],
[ 4, 4, 3],
[ 7, 5, 4],
[ 1, 5, 8]], dtype=int64)
In [75]:
print (df3.to_csv(columns=['cost'], sep='\t', index=False))
cost
4
4
4
3
6
7
5
4
7
23
4
7
1
In [94]:
kk=np.array(numpyMatrix)
kk
Out[94]:
array([[ 4, 3, 5],
[ 4, 4, 7],
[ 4, 5, 4],
[ 3, 6, 5],
[ 6, 4, 3],
[ 7, 3, 6],
[ 5, 4, 4],
[ 4, 5, 7],
[ 7, 4, 4],
[23, 3, 8],
[ 4, 4, 3],
[ 7, 5, 4],
[ 1, 5, 8]], dtype=int64)
In [100]:
kk[:,0]
Out[100]:
array([ 4, 4, 4, 3, 6, 7, 5, 4, 7, 23, 4, 7, 1], dtype=int64)
In [101]:
kk[:,1]
Out[101]:
array([3, 4, 5, 6, 4, 3, 4, 5, 4, 3, 4, 5, 5], dtype=int64)
In [102]:
kk[:,2]
Out[102]:
array([5, 7, 4, 5, 3, 6, 4, 7, 4, 8, 3, 4, 8], dtype=int64)
In [103]:
df3= pd.DataFrame({"cost":[4, 4, 4, 3, 6, 7, 5, 4, 7, 23, 4, 7, 1], "overhead":[3, 4, 5, 6, 4, 3, 4, 5, 4, 3, 4, 5, 5], "revenue": [5, 7, 4, 5, 3, 6, 4, 7, 4, 8, 3, 4, 8]})
reg = ols (y=df3["cost"], x=df3[["overhead","revenue"]])
reg
Out[103]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <overhead> + <revenue> + <intercept>
Number of Observations: 13
Number of Degrees of Freedom: 3
R-squared: 0.3185
Adj R-squared: 0.1822
Rmse: 4.8625
F-stat (2, 10): 2.3363, p-value: 0.1470
Degrees of Freedom: model 2, resid 10
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
overhead -2.8085 1.5201 -1.85 0.0944 -5.7878 0.1709
revenue 0.7575 0.7885 0.96 0.3594 -0.7880 2.3029
intercept 13.9969 8.0440 1.74 0.1125 -1.7694 29.7631
---------------------------------End of Summary---------------------------------