Question

我正致力于根据其独特的参考ID，对60,000个争议声明制作预测的响应概率（二元：是或否（1,0））。使用数据的前3/4作为训练集（X_train，y_train），使用逻辑回归作为分类器来预测最后1/4作为测试集（X_test）的响应概率，我想将输出变为60,000索引系列，输出看起来像

＆＃13;

reference_id
   184932    0.531842
   185362    0.401958
   185361    0.105928
   185338    0.018572
             ...
   276499    0.208567
   276500    0.818759
   269851    0.018528
   Name: response, dtype: float32

＆＃13;

我实现了以下Python代码：

＆＃13;

y_score_lr = LogisticRegression(C=10).fit(X_train, y_train).predict_proba(X_test)[:,1]
y_proba = y_score_lr

＆＃13;

结果是像这样的numpy数组

＆＃13;

array([ 0.05225495,  0.00522493,  0.07369773, ...,  0.06994582, 0.06995239,  0.12659022])

＆＃13;

这是一个numpy数组。

但是我不确定这个数组是否与原始X_test数据框中的相应reference_id实际匹配，而且我还没想出如何将它转换为索引的＆＃34;系列＆＃34;就像我在本文开头提到的那个。

如果有人能指出我实现这一目标的有用捷径，将非常感激。

我也尝试过使用

＆＃13;

y_score_lr = LogisticRegression(C=10).fit(X_train, y_train).predict_proba(X_test)[:,1]
y_proba = y_score_lr.tolist()

＆＃13;

将数组转换为列表，但仍无法使用＆＃39; reference_id＆＃39;将其转换为所需的系列类型输出。索引。

谢谢。

此致

Answer 1

首先，是的，它匹配y_proba的值：第一行对应dtype=pandas.Series数组中的第一个值。

其次，有几种方法可以解决这个问题。

假设您需要import pandas as pd import numpy as np y_proba_indexed = pd.Series( data=y_proba, index=X_test['reference_id'], name='response', dtype=np.float32) print(y_proba_indexed)：

，可能的解决方案之一可能如下

84932     0.531842
185362    0.401958
185361    0.105928
185338    0.018572
      ....
276499    0.208567
276500    0.818759
269851    0.018528
Name: response, dtype: float32

这会给你这样的东西：

reference_id = 185338

例如，要访问引用y_proba_indexed.loc[[185338]]的概率，您可以键入：185338 0.018572 Name: respone, dtype: float32，输出将为：

params[:id] = 1248  // here is example of request params
id=params[:id]      // this id goes to message SQL like table name with prefix:
Message(id).all => select * from messages_1248

将python sklearn概率估计编入索引序列

1 个答案: