您好,我正在观看有关Udemy的视频。我们正在尝试应用随机森林分类器。在此之前,我们将数据帧中的列之一转换为字符串。 “ Cabin”列代表诸如“ 4C”之类的值,但是为了减少唯一值的数量,我们希望仅使用第一个数字映射到新列“ Cabin_mapped”。
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
下面的这一部分只是将数据分为训练和测试集。参数对于解决问题并不重要。
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
适合后,我在这里遇到错误,说我无法将字符串转换为浮点数。 射频= RandomForestClassifier(n_estimators = 200,random_state = 39) rf.fit(X_train_less_cat,y_train)
似乎我需要将输入之一转换回float以使用随机森林算法。尽管该错误未在教程视频中显示,但仍存在。如果有人可以帮助我,那就太好了。
答案 0 :(得分:1)
这是一个完全正常的示例-我突出显示了您所缺少的一点。您需要将每个列都转换为数字,而不仅仅是“机舱”。
!wget https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv
import pandas as pd
data = pd.read_csv("train.csv")
data['Cabin_mapped'] = data['Cabin'].astype(str).str[0]
# this transforms the letters into numbers
cabin_dict = {k:i for i, k in enumerate(
data['Cabin_mapped'].unique(),0)}
data.loc[:,'Cabin_mapped'] = data.loc[:,'Cabin_mapped'].map(cabin_dict)
data[['Cabin_mapped', 'Cabin']].head()
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
## YOU ARE MISSING THIS BIT, some of your columns are still strings
## they need to be converted to numbers (ints OR floats)
for n,v in data.items():
if v.dtype == "object":
data[n] = v.factorize()[0]
## END of the bit you're missing
use_cols = data.drop("Survived",axis=1).columns
X_train_less_cat, X_test_less_cat, y_train, y_test = \
train_test_split(data[use_cols].fillna(0), data.Survived,
test_size = 0.3, random_state=0)
rf = RandomForestClassifier(n_estimators=200, random_state=39)
rf.fit(X_train_less_cat, y_train)