ValueError:无法将字符串转换为float:'n'

时间:2018-09-20 23:05:58

标签: python dataframe machine-learning sklearn-pandas

您好,我正在观看有关Udemy的视频。我们正在尝试应用随机森林分类器。在此之前,我们将数据帧中的列之一转换为字符串。 “ Cabin”列代表诸如“ 4C”之类的值,但是为了减少唯一值的数量,我们希望仅使用第一个数字映射到新列“ Cabin_mapped”。

enter image description here

data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
    data['Cabin_mapped'].unique(),0)}

data.loc[:,'Cabin_mapped'] =  data.loc[:,'Cabin_mapped'].map(cabin_dict)

data[['Cabin_mapped', 'Cabin']].head() 

下面的这一部分只是将数据分为训练和测试集。参数对于解决问题并不重要。

X_train_less_cat, X_test_less_cat, y_train, y_test = \
    train_test_split(data[use_cols].fillna(0), data.Survived, 
                     test_size = 0.3, random_state=0) 

适合后,我在这里遇到错误,说我无法将字符串转换为浮点数。     射频= RandomForestClassifier(n_estimators = 200,random_state = 39)     rf.fit(X_train_less_cat,y_train)

似乎我需要将输入之一转换回float以使用随机森林算法。尽管该错误未在教程视频中显示,但仍存在。如果有人可以帮助我,那就太好了。

1 个答案:

答案 0 :(得分:1)

这是一个完全正常的示例-我突出显示了您所缺少的一点。您需要将每个列都转换为数字,而不仅仅是“机舱”。

!wget https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv

import pandas as pd

data = pd.read_csv("train.csv")




data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
    data['Cabin_mapped'].unique(),0)}

data.loc[:,'Cabin_mapped'] =  data.loc[:,'Cabin_mapped'].map(cabin_dict)

data[['Cabin_mapped', 'Cabin']].head()


from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split


## YOU ARE MISSING THIS BIT, some of your columns are still strings
## they need to be converted to numbers (ints OR floats)
for n,v in data.items():
    if v.dtype == "object":
        data[n] = v.factorize()[0]
## END of the bit you're missing

use_cols = data.drop("Survived",axis=1).columns

X_train_less_cat, X_test_less_cat, y_train, y_test = \
    train_test_split(data[use_cols].fillna(0), data.Survived, 
                    test_size = 0.3, random_state=0) 


rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)