Question

我已经从scikit-learn分类器中生成了一个概率数据框，如下所示：

def preprocess_category_series(series, key):
    if series.dtype != 'category':
        return series
    if series.cat.ordered:
        s = pd.Series(series.cat.codes, name=key)
        mode = s.mode()[0]
        s[s<0] = mode
        return s
    else:
        return pd.get_dummies(series, drop_first=True, prefix=key)

data = df[df.year == 2012]
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1)
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)])

我现在想将这些概率追溯到我的原始数据帧。但是，上面生成的predictions数据框虽然保留了data中的项目顺序，却丢失了data的索引。我以为我能够做到

pd.concat([data, predictions], axis=1, ignore_index=True)

但这会产生错误：

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

我已经看到，如果列名重复，有时会出现这种情况，但在这种情况下，没有。那个错误是什么？将这些数据帧重新组合在一起的最佳方法是什么。

data.head()：

                year serial  hwtfinl                       region statefip  \
cpsid                                                                        
20121000000100  2012      1  3796.85  East South Central Division  Alabama   
20121000000100  2012      1  3796.85  East South Central Division  Alabama   
20121000000100  2012      1  3796.85  East South Central Division  Alabama   
20120800000500  2012      6  2814.24  East South Central Division  Alabama   
20120800000600  2012      7  2828.42  East South Central Division  Alabama   

                county  month  pernum          cpsidp     wtsupp   ...    \
cpsid                                                              ...     
20121000000100       0     11       1  20121000000101  3208.1213   ...     
20121000000100       0     11       2  20121000000102  3796.8506   ...     
20121000000100       0     11       3  20121000000103  3386.4305   ...     
20120800000500       0     11       1  20120800000501  2814.2417   ...     
20120800000600    1097     11       1  20120800000601  2828.4193   ...     

                 race        hispan educ           votereg  \
cpsid                                                        
20121000000100  White  Not Hispanic  111             Voted   
20121000000100  White  Not Hispanic  111  Did not register   
20121000000100  White  Not Hispanic  111             Voted   
20120800000500  White  Not Hispanic   92             Voted   
20120800000600  White  Not Hispanic   73  Did not register   

                                         educ_parsed      age4         educ4  \
cpsid                                                                          
20121000000100                     Bachelor's degree       65+  College grad   
20121000000100                     Bachelor's degree       65+  College grad   
20121000000100                     Bachelor's degree  Under 30  College grad   
20120800000500  Associate's degree, academic program     45-64  College grad   
20120800000600     High school diploma or equivalent       65+    HS or less   

                race4 region4  gender  
cpsid                                  
20121000000100  White   South    Male  
20121000000100  White   South  Female  
20121000000100  White   South  Female  
20120800000500  White   South  Female  
20120800000600  White   South  Female

predictions.head()：

          a         b         c         d         e         f
0  0.119534  0.336761  0.188023  0.136651  0.095342  0.123689
1  0.148409  0.346429  0.134852  0.169661  0.087556  0.113093
2  0.389586  0.195802  0.101738  0.085705  0.114612  0.112557
3  0.277783  0.262079  0.180037  0.102030  0.071171  0.106900
4  0.158404  0.396487  0.088064  0.079058  0.171540  0.106447

只是为了好玩，我已经专门尝试了这一点，只有头行：

pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True)

出现同样的错误。

Answer 1

我也是0.18.0。这是我尝试过的，也是有效的。这是你在做什么？

import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X,Y)
import pandas as pd
data = pd.DataFrame(X)
data['y']=Y
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(X)])
pd.concat([data, predictions], axis=1, ignore_index=True)
0  1  2             3             4
0 -1 -1  1  1.000000e+00  1.522998e-08
1 -2 -1  1  1.000000e+00  3.775135e-11
2 -3 -2  1  1.000000e+00  5.749523e-19
3  1  1  2  1.522998e-08  1.000000e+00
4  2  1  2  3.775135e-11  1.000000e+00
5  3  2  2  5.749523e-19  1.000000e+00

Answer 2

事实证明，有一个相对简单的解决方案：

predictions.index = data.index
pd.concat([data, predictions], axis=1)

现在它完美无缺。不知道为什么它不会像我最初尝试的那样工作。

将数据帧与Pandas中的不同索引相结合

2 个答案: