无法将字符串转换为在python中浮动以及如何使用此数据集训练模型

时间:2018-08-17 12:44:40

标签: pandas machine-learning scikit-learn

我有一个数据集,其中包含列:年龄(浮动类型),性别(str类型),区域(str类型)和费用(浮动类型)。

我想以年龄性别和地区为特征来预测收费,如何在scikit中学习呢?

我尝试了一些操作,但显示为"ValueError: could not convert string to float: 'northwest' "

import pandas as pd
import numpy as np
df = pd.read_csv('Desktop/insurance.csv')
X = df.loc[:,['age','sex','region']].values
y = df.loc[:,['charges']].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn import svm
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

1 个答案:

答案 0 :(得分:1)

region包含字符串,因为它不是向量,所以不能在SVM分类器中使用。

因此,您必须将本专栏变成SVM可以使用的内容。这是将region更改为分类序列的示例:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'age':[20,30,40,50],
              'sex':['male','female','female','male'],
              'region':['northwest','southwest','northeast','southeast'],
              'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df.region = pd.Categorical(df.region)
df.region = df.region.cat.codes
X = df.loc[:,['age','sex','region']]
y = df.loc[:,['charges']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

解决此问题的另一种方法是使用单热矢量编码:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'age':[20,30,40,50],
              'sex':['male','female','female','male'],
              'region':['northwest','southwest','northeast','southeast'],
              'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df = pd.concat([df,pd.get_dummies(df.region)],axis = 1).drop('region',1)
X = df.drop('charges',1)
y = df.charges
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)

另一种方法是执行标签编码:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.region = le.fit_transform(df.region)

此方法列表当然并不详尽,并且根据您的问题而有所不同。

使用非数字数据是不平凡的,需要一些有关现有技术的知识(我鼓励您去kaggle的论坛中搜索,在那里您可以找到有价值的信息)。