将model.predict()的结果与原始pandas DataFrame合并?

时间:2016-11-21 20:52:51

标签: python pandas scikit-learn

我正在尝试将predict方法的结果与pandas.DataFrame对象中的原始数据合并。

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

要将这些预测与原始df合并,我试试这个:

df['y_hats'] = y_hats

但是提出了:

  

ValueError:值的长度与索引的长度

不匹配

我知道我可以将df拆分为train_dftest_df这个问题会解决,但实际上我需要按照上面的路径来创建矩阵{{1和} X(我的实际问题是文本分类问题,在分解为训练和测试之前,我将整个特征矩阵规范化)。如何将这些预测值与我y中的相应行对齐,因为df数组是零索引的,而且似乎所有关于的哪些行都包含在{ {1}}和y_hats丢失了吗?或者我是否会将数据帧首先分解为列车测试,然后构建特征矩阵?我想在数据框中填充X_test中包含y_test值的行。

8 个答案:

答案 0 :(得分:12)

你的y_hats长度只是测试数据的长度(20%)因为你在X_test上预测了。一旦您的模型得到验证并且您对测试预测感到满意(通过检查模型在X_test预测上与X_test真值相比的准确性),您应该在完整数据集(X)上重新运行预测。将这两行添加到底部:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2
根据您的评论

编辑,这是一个更新后的结果,返回数据集,并附加预测附加在测试数据集中的位置

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

答案 1 :(得分:0)

您可能可以制作一个新的数据框,并将测试数据以及预测值添加到其中:

data['y_hats'] = y_hats
data.to_csv('data1.csv')

答案 2 :(得分:0)

我几乎有同样的问题

我以这种方式解决了

...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_hats = model.predict(X_test)

y_hats  = pd.DataFrame(y_hats)

df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]


y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

答案 3 :(得分:0)

您可以从X_test创建y_hat数据帧复制索引,然后与原始数据合并。

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

请注意,左联接将包括火车数据行。省略“ how”参数将仅生成测试数据。

答案 4 :(得分:0)

尝试一下:

local function update_buffer(buffer, c)
  if not c.r_frac then c.r_frac = 0 end
  if not c.g_frac then c.g_frac = 0 end
  if not c.b_frac then c.b_frac = 0 end
  local r2 = c.r_frac >= 0 and c.r + 1 or c.r - 1
  local g2 = c.g_frac >= 0 and c.g + 1 or c.g - 1
  local b2 = c.b_frac >= 0 and c.b + 1 or c.b - 1
  local r3, g3, b3
  local set = buffer.set
  for i = 1, NUM_LEDS do
    if i > c.r_frac then r3 = c.r else r3 = r2 end
    if i > c.g_frac then g3 = c.g else g3 = g2 end
    if i > c.b_frac then b3 = c.b else b3 = b2 end
    set(buffer, i - 1, g3, r3, b3)
  end
end

答案 5 :(得分:0)

predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'], 
                            index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True, 
                 right_index=True)

答案 6 :(得分:0)

这对我来说效果很好。它维护索引位置。

<Style TargetType="GroupBox">
        <Setter Property="Background" Value="Beige"/>
        <Setter Property="Opacity" Value="0.3"/>
</Style>

答案 7 :(得分:-1)

您也可以使用

y_hats = model.predict(X)

df['y_hats'] = y_hats.reset_index()['name of the target column']