我已经从scikit-learn分类器中生成了一个概率数据框,如下所示:
def preprocess_category_series(series, key):
if series.dtype != 'category':
return series
if series.cat.ordered:
s = pd.Series(series.cat.codes, name=key)
mode = s.mode()[0]
s[s<0] = mode
return s
else:
return pd.get_dummies(series, drop_first=True, prefix=key)
data = df[df.year == 2012]
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1)
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)])
我现在想将这些概率追溯到我的原始数据帧。但是,上面生成的predictions
数据框虽然保留了data
中的项目顺序,却丢失了data
的索引。我以为我能够做到
pd.concat([data, predictions], axis=1, ignore_index=True)
但这会产生错误:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
我已经看到,如果列名重复,有时会出现这种情况,但在这种情况下,没有。那个错误是什么?将这些数据帧重新组合在一起的最佳方法是什么。
data.head()
:
year serial hwtfinl region statefip \
cpsid
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20121000000100 2012 1 3796.85 East South Central Division Alabama
20120800000500 2012 6 2814.24 East South Central Division Alabama
20120800000600 2012 7 2828.42 East South Central Division Alabama
county month pernum cpsidp wtsupp ... \
cpsid ...
20121000000100 0 11 1 20121000000101 3208.1213 ...
20121000000100 0 11 2 20121000000102 3796.8506 ...
20121000000100 0 11 3 20121000000103 3386.4305 ...
20120800000500 0 11 1 20120800000501 2814.2417 ...
20120800000600 1097 11 1 20120800000601 2828.4193 ...
race hispan educ votereg \
cpsid
20121000000100 White Not Hispanic 111 Voted
20121000000100 White Not Hispanic 111 Did not register
20121000000100 White Not Hispanic 111 Voted
20120800000500 White Not Hispanic 92 Voted
20120800000600 White Not Hispanic 73 Did not register
educ_parsed age4 educ4 \
cpsid
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree 65+ College grad
20121000000100 Bachelor's degree Under 30 College grad
20120800000500 Associate's degree, academic program 45-64 College grad
20120800000600 High school diploma or equivalent 65+ HS or less
race4 region4 gender
cpsid
20121000000100 White South Male
20121000000100 White South Female
20121000000100 White South Female
20120800000500 White South Female
20120800000600 White South Female
predictions.head()
:
a b c d e f
0 0.119534 0.336761 0.188023 0.136651 0.095342 0.123689
1 0.148409 0.346429 0.134852 0.169661 0.087556 0.113093
2 0.389586 0.195802 0.101738 0.085705 0.114612 0.112557
3 0.277783 0.262079 0.180037 0.102030 0.071171 0.106900
4 0.158404 0.396487 0.088064 0.079058 0.171540 0.106447
只是为了好玩,我已经专门尝试了这一点,只有头行:
pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True)
出现同样的错误。
答案 0 :(得分:0)
我也是0.18.0。这是我尝试过的,也是有效的。这是你在做什么?
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X,Y)
import pandas as pd
data = pd.DataFrame(X)
data['y']=Y
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(X)])
pd.concat([data, predictions], axis=1, ignore_index=True)
0 1 2 3 4
0 -1 -1 1 1.000000e+00 1.522998e-08
1 -2 -1 1 1.000000e+00 3.775135e-11
2 -3 -2 1 1.000000e+00 5.749523e-19
3 1 1 2 1.522998e-08 1.000000e+00
4 2 1 2 3.775135e-11 1.000000e+00
5 3 2 2 5.749523e-19 1.000000e+00
答案 1 :(得分:0)
事实证明,有一个相对简单的解决方案:
predictions.index = data.index
pd.concat([data, predictions], axis=1)
现在它完美无缺。不知道为什么它不会像我最初尝试的那样工作。