ValueError:使用序列设置数组元素。 Desicion Tree

时间:2018-01-17 15:56:53

标签: python pandas scikit-learn

我认为问题在于我的变量'info.venue'。它实际上是String值,我使用labelencoder和hotoneencoder编码。 但当我尝试实施决策树时,它给了我错误。当我尝试只使用2个变量时,它就像一个魅力。但当我使用一个Hot编码器使用'info.venue'时,它会给我以下错误。

错误是“值错误:使用序列设置数组元素”

info.toss.decision info.toss.winner  info.venue
        field            Australia  Shere Bangla National Stadium
        field            Australia  Adelaide Oval
        field            Australia  Melbourne Cricket Ground
        bat              Australia  Brabourne Stadium
        bat              Australia  Melbourne Cricket Ground
        bat              Australia  Sydney Cricket Ground
        bat              Australia  Punjab Cricket Association 
        field            India      Kensington Oval, Bridgetown
        field            India      Stadium Australia
       field             India      Saurashtra Cricket Association Stadium
        bat              India      Kingsmead
        bat              India      Melbourne Cricket Ground
        bat              India      R Premadasa Stadium

代码如下:

使用LabelEncoder和OneHotEncoder对数据进行编码

> from sklearn.preprocessing import LabelEncoder,OneHotEncoder
> labelencoder=LabelEncoder() onehotencoder=OneHotEncoder()
> df['info.toss.decision'] =
> labelencoder.fit_transform(df['info.toss.decision'])
> df['info.toss.winner']=
> labelencoder.fit_transform(df['info.toss.winner'])
> df['info.outcome.winner']=
> labelencoder.fit_transform(df['info.outcome.winner'])
> df['info.venue']=labelencoder.fit_transform(df['info.venue'])
> df['info.venue']=onehotencoder.fit_transform(df[['info.venue']])

从数据框

中选择特定列
X = df[['info.venue','info.toss.decision','info.toss.winner']]
Y = df[['info.outcome.winner']]

将数据集拆分为训练集和测试集

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)

将决策树分类拟合到训练集

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)

列'info.venue'如下;

info.venue

Kingsmead
Melbourne Cricket Ground
Brabourne Stadium
Kensington Oval, Bridgetown
Stadium Australia
Melbourne Cricket Ground
R Premadasa Stadium
Saurashtra Cricket Association Stadium
Shere Bangla National Stadium
Adelaide Oval
Melbourne Cricket Ground
Sydney Cricket Ground
Punjab Cricket Association IS Bindra Stadium, Mohali

2 个答案:

答案 0 :(得分:1)

此错误是因为您尝试将2d数组分配给pandas中的单个列。

默认情况下,OneHotEncoder返回稀疏矩阵,该矩阵通过pandas标识为对象数组。所以会发生什么是熊猫会接受并将完整的2D对象广播到数据帧的所有行。然后在DecisionTree的拟合过程中,它会抛出错误。

所以你需要改变它:

ohe_data = onehotencoder.fit_transform(df[['info.venue']]).toarray()
for i in np.arange(onehotencoder.n_values_):
    df['infovenue_one_coded_'+str(i)]=ohe_data[:,i]

然后从数据框中删除原始列:

new_df = df.drop('info.venue', 1)

然后将此new_df传递给DecisionTree。

<强>更新

由于您首先转换为一个热编码数据,然后将其拆分为训练和测试,我建议使用pd.get_dummies(),它将从您的代码中替换LabelEncoder和OneHotEncoder。

替换这些行:

df['info.venue']=labelencoder.fit_transform(df['info.venue'])
df['info.venue']=onehotencoder.fit_transform(df[['info.venue']])

new_df = pd.concat([df, pd.get_dummies(df['info.venue'])], axis=1)
new_df = df.drop('info.venue', axis=1, inplace=True)

答案 1 :(得分:0)

因为X值与[[0,0,1],0,2]非常相似,而不是正确的2D数据,这将导致Setting an array element with a sequence。作为scikit的one_hot_encoder的替代方法,您可以使用pandas中的get_dummies并将其连接到dataframe,即

dummies =  df['info.venue'].str.get_dummies()
ndf = pd.concat([df.drop(['info.venue'],1),dummies],1)

稍后你可以将ndf分成X和Y.即

mask = ndf.columns.isin(['info.outcome.winner'])
# Were are using isin here because there will be huge number of columns generated due to get_dummies as sparse.    
X = ndf[ndf.columns[mask]].values
Y = ndf[ndf.columns[~mask]].values