DictVectorizer问题:为不同的输入创建不同数量的功能

时间:2017-01-12 22:05:41

标签: python scikit-learn random-forest

我正在尝试编写机器学习算法,我试图预测输出是+50000还是-50000。这样做我使用随机森林分类器来使用11个字符串功能。但由于Random Forest Classifier需要以float / numbers的形式输入,我使用DictVectorizer将字符串功能转换为float / numbers。但是对于数据中的不同行,DictVectorizer会创建不同数量的功能(240-260)。这导致预测模型输出时出错。 一个样本输入行是:

{'detailed household summary in household': ' Spouse of householder',
 'tax filer stat': ' Joint both under 65',
 'weeks worked in year': ' 52',
 'age': '32', 
 'sex': ' Female',
 'marital status': ' Married-civilian spouse present',
 'full or part time employment stat': ' Full-time schedules',
 'detailed household and family stat': ' Spouse of householder', 
 'education': ' Bachelors degree(BA AB BS)',
 'num persons worked for employer': ' 3',
 'major occupation code': ' Adm support including clerical'}

我是否可以通过某种方式转换输入,以便我可以使用随机林分类器来预测输出。

编辑: 我用来执行此操作的代码是:

    X,Y=[],[]
    features=[0,4,7,9,12,15,19,22,23,30,39]
    with open("census_income_learn.csv","r") as fl:
        reader=csv.reader(fl)
        for row in reader:
            data={}
            for i in features:
                data[columnNames[i]]=str(row[i])
            X.append(data)
            Y.append(str(row[41]))

    X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)

    vec = DictVectorizer()
    X_train=vec.fit_transform(X_train).toarray()
    X_validate=vec.fit_transform(X_validate).toarray()
    print("data ready")

    forest = RandomForestClassifier(n_estimators = 100)
    forest = forest.fit( X_train, Y_train )
    print("model created")

    Y_predicted=forest.predict(X_validate)
    print(Y_predicted)

所以这里如果我尝试打印训练集和验证集的第一个元素,我在X_train [0]中得到252个特征,而X_validate [0]中有249个特征。

1 个答案:

答案 0 :(得分:2)

试试这个:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

cols = [0,4,7,9,12,15,19,22,23,30,39,  41]
names = [
 'detailed household summary in household',
 'sex',
 'full or part time employment stat',
 'age',
 'detailed household and family stat',
 'weeks worked in year',
 'num persons worked for employer',
 'major occupation code',
 'tax filer stat',
 'education',
 'marital status',
 'TARGET'
]

fn = r'D:\temp\.data\census_income_learn.csv'
data = pd.read_csv(fn, header=None, usecols=cols, names=names)

# http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn    
df = data.apply(LabelEncoder().fit_transform)

X, Y = np.split(df, [11], axis=1)
X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)

forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( X_train, Y_train )

Y_predicted=forest.predict(X_validate)