Question

我正在建立OLS模型，但无法做出任何预测。

您能解释我在做什么吗？

建立模型：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

预测：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')

df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

然后它显示：ValueError：形状（3,1）和（11，）没有对齐：1（dim 1）！= 11（dim 0）

我在做什么错了？

Answer 1

以下是我的评论中固定的代码预测部分：

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

主要问题是训练X1和x_new数据集中的假人数量不同。在下面，我添加缺少的虚拟列，并用零填充：

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

现在x_new的列数等于训练数据集X1：

               const  Lisbon  London  Madrid  ...  Master Card  Visa  No  Yes
Client Number                                 ...                            
11                 0       0       0       0  ...            0     1   0    1
12                 0       0       0       0  ...            0     1   0    1
13                 0       1       0       0  ...            0     1   1    0

[3 rows x 11 columns]

最后使用先前训练的模型x_new对新数据集reg进行预测：

reg.predict(x_new)

结果：

Client Number
11     35.956284
12     35.956284
13    135.956284
dtype: float64

APPENDIX

根据要求，我在下面附上完全可复制的代码以测试训练和预测任务：

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

###
d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')
df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
x_new = df1.drop(columns='Total')

x_new = x_new.reindex(columns = X1.columns, fill_value=0)

reg.predict(x_new)

Answer 2

最大的问题是您没有使用相同的虚拟转换。也就是说，缺少df1中的某些值。您可以使用以下代码（来自here）添加缺少的值/列：

data1

此外，您将node1和d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 'Card': ['Visa','Visa','Visa'], 'Colateral':['Yes','Yes','No'], 'Client Number':[11,12,13], 'Total':[0,0,0]} df1 = pd.DataFrame(data=d1).set_index('Client Number') df1 = pd.get_dummies(df1,prefix='', prefix_sep='') print(df1.shape) # Shape is 3x6 but it has to be 3x11 # Get missing columns in the training test missing_cols = set( df.columns ) - set( df1.columns ) # Add a missing column in test set with default value equal to 0 for c in missing_cols: df1[c] = 0 # Ensure the order of column in the test set is in the same order than in train set df1 = df1[df.columns] print(df1.shape) # Shape is 3x11混合在一起。所以应该是：

x_new

请注意，我使用y_new代替了x_new = df1.drop(['Total'], axis=1).values y_new = df1['Total'].values mod = sm.OLS(y_new, x_new) mod.predict(reg.params)，因为它更方便（就1而言）更不容易（键入）错误，而2则更少了代码。

Answer 3

首先，您需要对所有单词进行字符串索引或对值进行一次热编码。 ML模型不接受单词，仅接受数字。接下来，您希望X和y为：

X = d.iloc[:,:-1]
y = d.iloc[:,-1]

这样，X的形状为[11,3]，而y的形状为[11，]，这是需要的适当形状。

无法根据OLS模型进行预测

3 个答案: