我对scikit-learn很新,我正在尝试使用这个包来预测收入数据。 这可能是一个重复的问题,因为我看到了另一篇文章,但我正在寻找一个简单的例子来理解scikit-learn估算器的预期。
我拥有的数据具有以下结构,其中许多功能是分类的(例如:工作类,教育......)
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
示例记录:
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
我很难处理分类功能,因为sckit-learn中的大多数模型都希望所有功能都是数字? 他们提供了一些类来转换/编码这些功能(如Onehotencoder,DictVectorizer),但我找不到在我的数据上使用这些功能的方法。我知道在将它们完全编码为数字之前,这里涉及了很多步骤,但我只是想知道是否有人知道更简单有效(因为有太多这样的特性)的方式可以通过一个例子来理解。 我隐约知道DictVectorizer是要走的路,但需要帮助才能在这里继续。
答案 0 :(得分:6)
以下是使用DictVectorizer
的一些示例代码。首先,让我们在Python shell中设置一些数据。我将文件留给你阅读。
>>> features = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
... "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"]
>>> input_text = """38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
... 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
... 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 India >50K
... """
现在,解析这些:
>>> for ln in input_text.splitlines():
... values = ln.split()
... y.append(values[-1])
... d = dict(zip(features, values[:-1]))
... samples.append(d)
我们现在得到了什么?我们来看看:
>>> from pprint import pprint
>>> pprint(samples[0])
{'age': '38',
'capital-gain': '0',
'capital-loss': '0',
'education': 'HS-grad',
'education-num': '9',
'fnlwgt': '215646',
'hours-per-week': '40',
'marital-status': 'Divorced',
'native-country': 'United-States',
'occupation': 'Handlers-cleaners',
'race': 'White',
'relationship': 'Not-in-family',
'sex': 'Male',
'workclass': 'Private'}
>>> print(y)
['<=50K', '<=50K', '>50K']
这些samples
已准备好DictVectorizer
,因此请传递它们:
>>> from sklearn.feature_extraction import DictVectorizer
>>> dv = DictVectorizer()
>>> X = dv.fit_transform(samples)
>>> X
<3x29 sparse matrix of type '<type 'numpy.float64'>'
with 42 stored elements in Compressed Sparse Row format>
但是,如果它支持稀疏矩阵,您可以将X
和y
传递给估算器。 (否则,将sparse=False
传递给DictVectorizer
构造函数。)
测试样本同样可以传递给DictVectorizer.transform
;如果测试集中存在未在训练集中出现的特征/值组合,则这些组合将被忽略(因为学习的模型无论如何都无法对它们做任何合理的事情)。