如何使用xgboost模型在数据框中的单行上进行预测?

时间:2020-09-17 15:37:19

标签: python pandas xgboost

我正在将xgboost模型适合存储在数据框中的某些数据。拟合后,我想在数据框的单行上运行分类器/回归器的.predict方法。

以下是一个最小的示例,该示例在整个数据帧上预测良好,但是仅在数据帧的第二行上运行时会崩溃。

from sklearn.datasets import load_iris
import xgboost

# Load iris data such that X is a dataframe
X, y = load_iris(return_X_y=True, as_frame=True)

clf = xgboost.XGBClassifier()
clf.fit(X, y)

# Predict for all rows - works fine
y_pred = clf.predict(X)

# Predict for single row. Crashes.
# Error: '('Expecting 2 dimensional numpy.ndarray, got: ', (4,))'
secondrow = X.iloc[1]
secondpred = clf.predict(secondrow)

错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-a06c6820c458> in <module>
     11 # Error: '('Expecting 2 dimensional numpy.ndarray, got: ', (4,))'
     12 secondrow = X.iloc[1]
---> 13 secondpred = clf.predict(secondrow)

e:\Anaconda3\envs\py37\lib\site-packages\xgboost\sklearn.py in predict(self, data, output_margin, ntree_limit, validate_features)
    789                                                  output_margin=output_margin,
    790                                                  ntree_limit=ntree_limit,
--> 791                                                  validate_features=validate_features)
    792         if output_margin:
    793             # If output_margin is active, simply return the scores

e:\Anaconda3\envs\py37\lib\site-packages\xgboost\core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features)
   1282 
   1283         if validate_features:
-> 1284             self._validate_features(data)
   1285 
   1286         length = c_bst_ulong()

e:\Anaconda3\envs\py37\lib\site-packages\xgboost\core.py in _validate_features(self, data)
   1688 
   1689                 raise ValueError(msg.format(self.feature_names,
-> 1690                                             data.feature_names))
   1691 
   1692     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] ['f0', 'f1', 'f2', 'f3']
expected petal length (cm), petal width (cm), sepal length (cm), sepal width (cm) in input data
training data did not have the following fields: f1, f3, f0, f2

1 个答案:

答案 0 :(得分:1)

  • predict期望基于模型fit的特定形状的数组。
  • 问题是,secondrow是一维pandas.Series,与模型的形状不匹配。
X.iloc[1]

sepal length (cm)    4.9
sepal width (cm)     3.0
petal length (cm)    1.4
petal width (cm)     0.2
Name: 1, dtype: float64

# look at the array
X.iloc[1].values

array([4.9, 3. , 1.4, 0.2])  # note this is a 1-d array

# look at the shape
secondrow.values.shape

(4,)
  • 通过传递正确形状的数据(二维数组),您可以查看一行。
  • 将“系列”选择转换为DataFrame,并将其转置为.predict的正确形状。
secondrow = pd.DataFrame(X.iloc[1]).T

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
1                4.9               3.0                1.4               0.2

# look at secondrow as an array
secondrow.values

array([[4.9, 3. , 1.4, 0.2]])  # note this is a 2-d array

# look at the shape
secondrow.values.shape

(1, 4)

# predict
secondpred = clf.predict(secondrow)

# result
array([0])