我认为问题在于我的变量'info.venue'。它实际上是String值,我使用labelencoder和hotoneencoder编码。 但当我尝试实施决策树时,它给了我错误。当我尝试只使用2个变量时,它就像一个魅力。但当我使用一个Hot编码器使用'info.venue'时,它会给我以下错误。
错误是“值错误:使用序列设置数组元素”
info.toss.decision info.toss.winner info.venue
field Australia Shere Bangla National Stadium
field Australia Adelaide Oval
field Australia Melbourne Cricket Ground
bat Australia Brabourne Stadium
bat Australia Melbourne Cricket Ground
bat Australia Sydney Cricket Ground
bat Australia Punjab Cricket Association
field India Kensington Oval, Bridgetown
field India Stadium Australia
field India Saurashtra Cricket Association Stadium
bat India Kingsmead
bat India Melbourne Cricket Ground
bat India R Premadasa Stadium
代码如下:
> from sklearn.preprocessing import LabelEncoder,OneHotEncoder
> labelencoder=LabelEncoder() onehotencoder=OneHotEncoder()
> df['info.toss.decision'] =
> labelencoder.fit_transform(df['info.toss.decision'])
> df['info.toss.winner']=
> labelencoder.fit_transform(df['info.toss.winner'])
> df['info.outcome.winner']=
> labelencoder.fit_transform(df['info.outcome.winner'])
> df['info.venue']=labelencoder.fit_transform(df['info.venue'])
> df['info.venue']=onehotencoder.fit_transform(df[['info.venue']])
X = df[['info.venue','info.toss.decision','info.toss.winner']]
Y = df[['info.outcome.winner']]
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)
列'info.venue'如下;
info.venue
Kingsmead
Melbourne Cricket Ground
Brabourne Stadium
Kensington Oval, Bridgetown
Stadium Australia
Melbourne Cricket Ground
R Premadasa Stadium
Saurashtra Cricket Association Stadium
Shere Bangla National Stadium
Adelaide Oval
Melbourne Cricket Ground
Sydney Cricket Ground
Punjab Cricket Association IS Bindra Stadium, Mohali
答案 0 :(得分:1)
此错误是因为您尝试将2d数组分配给pandas中的单个列。
默认情况下,OneHotEncoder返回稀疏矩阵,该矩阵通过pandas标识为对象数组。所以会发生什么是熊猫会接受并将完整的2D对象广播到数据帧的所有行。然后在DecisionTree的拟合过程中,它会抛出错误。
所以你需要改变它:
ohe_data = onehotencoder.fit_transform(df[['info.venue']]).toarray()
for i in np.arange(onehotencoder.n_values_):
df['infovenue_one_coded_'+str(i)]=ohe_data[:,i]
然后从数据框中删除原始列:
new_df = df.drop('info.venue', 1)
然后将此new_df传递给DecisionTree。
<强>更新强>:
由于您首先转换为一个热编码数据,然后将其拆分为训练和测试,我建议使用pd.get_dummies()
,它将从您的代码中替换LabelEncoder和OneHotEncoder。
替换这些行:
df['info.venue']=labelencoder.fit_transform(df['info.venue'])
df['info.venue']=onehotencoder.fit_transform(df[['info.venue']])
与
new_df = pd.concat([df, pd.get_dummies(df['info.venue'])], axis=1)
new_df = df.drop('info.venue', axis=1, inplace=True)
答案 1 :(得分:0)
因为X值与[[0,0,1],0,2]
非常相似,而不是正确的2D数据,这将导致Setting an array element with a sequence
。作为scikit的one_hot_encoder的替代方法,您可以使用pandas中的get_dummies
并将其连接到dataframe
,即
dummies = df['info.venue'].str.get_dummies()
ndf = pd.concat([df.drop(['info.venue'],1),dummies],1)
稍后你可以将ndf分成X和Y.即
mask = ndf.columns.isin(['info.outcome.winner'])
# Were are using isin here because there will be huge number of columns generated due to get_dummies as sparse.
X = ndf[ndf.columns[mask]].values
Y = ndf[ndf.columns[~mask]].values