无法根据OLS模型进行预测

时间:2020-08-19 08:35:27

标签: python pandas dataframe linear-regression statsmodels

我正在建立OLS模型,但无法做出任何预测。

您能解释我在做什么吗?

建立模型:

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

预测:

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')

df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

然后它显示:ValueError:形状(3,1)和(11,)没有对齐:1(dim 1)!= 11(dim 0)

我在做什么错了?

3 个答案:

答案 0 :(得分:1)

以下是我的评论中固定的代码预测部分:

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

主要问题是训练X1x_new数据集中的假人数量不同。 在下面,我添加缺少的虚拟列,并用零填充:

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

现在x_new的列数等于训练数据集X1

               const  Lisbon  London  Madrid  ...  Master Card  Visa  No  Yes
Client Number                                 ...                            
11                 0       0       0       0  ...            0     1   0    1
12                 0       0       0       0  ...            0     1   0    1
13                 0       1       0       0  ...            0     1   1    0

[3 rows x 11 columns]

最后使用先前训练的模型x_new对新数据集reg进行预测:

reg.predict(x_new)

结果:

Client Number
11     35.956284
12     35.956284
13    135.956284
dtype: float64

APPENDIX

根据要求,我在下面附上完全可复制的代码以测试训练和预测任务:

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

reg.predict(x_new)

答案 1 :(得分:0)

最大的问题是您没有使用相同的虚拟转换。也就是说,缺少df1中的某些值。您可以使用以下代码(来自here)添加缺少的值/列:

data1

此外,您将node1d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 'Card': ['Visa','Visa','Visa'], 'Colateral':['Yes','Yes','No'], 'Client Number':[11,12,13], 'Total':[0,0,0]} df1 = pd.DataFrame(data=d1).set_index('Client Number') df1 = pd.get_dummies(df1,prefix='', prefix_sep='') print(df1.shape) # Shape is 3x6 but it has to be 3x11 # Get missing columns in the training test missing_cols = set( df.columns ) - set( df1.columns ) # Add a missing column in test set with default value equal to 0 for c in missing_cols: df1[c] = 0 # Ensure the order of column in the test set is in the same order than in train set df1 = df1[df.columns] print(df1.shape) # Shape is 3x11 混合在一起。所以应该是:

x_new

请注意,我使用y_new代替了x_new = df1.drop(['Total'], axis=1).values y_new = df1['Total'].values mod = sm.OLS(y_new, x_new) mod.predict(reg.params) ,因为它更方便(就1而言)更不容易(键入)错误,而2则更少了代码。

答案 2 :(得分:0)

首先,您需要对所有单词进行字符串索引或对值进行一次热编码。 ML模型不接受单词,仅接受数字。接下来,您希望X和y为:

X = d.iloc[:,:-1]
y = d.iloc[:,-1]

这样,X的形状为[11,3],而y的形状为[11,],这是需要的适当形状。