通过其他列的回归来填充缺失值(nan)

时间:2019-10-17 14:47:28

标签: python machine-learning regression nan feature-selection

我有一个包含很多缺失值(NAN)的数据集。我想在python中使用线性或多线性回归并填充所有缺失的值。您可以在此处找到数据集:Dataset

Dataset

我已经使用f_regression(X_train,Y_train)选择应该使用的功能。 首先,我将df ['country']转换为哑元,然后使用了重要特征,然后使用了回归,但结果不好。

我定义了以下功能来选择特征和缺失值:

def select_features(target,df):
    '''Get dataset and terget and print which features are important.'''
    df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
    df_nonan = df_dummies.dropna()

    X = df_nonan.drop([target],axis=1)
    Y = df_nonan[target]
    X = pd.get_dummies(X)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)

    f,pval = f_regression(X_train, Y_train)
    inds = np.argsort(pval)[::1]
    results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
    print(results)

我定义了以下函数来预测缺失值。

def train(target,features,df,deg=1):
    '''Get dataset, target and features and predict nan in target column'''

    df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
    df_nonan = df_dummies[[*features,target]].dropna()

    X = df_nonan.drop([target],axis=1)
    Y = df_nonan[target]

    pol = PolynomialFeatures(degree=deg)
    X=X[features]

    X = pd.get_dummies(X)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
    X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
    # X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
    X_train_n = pol.fit_transform(X_train)
    reg = linear_model.Lasso()
    reg.fit(X_train_n,Y_train);
    X_test_n = pol.fit_transform(X_test)

    Y_predtrain = reg.predict(X_train_n)
    print('train',r2_score(Y_train, Y_predtrain))
    Y_pred = reg.predict(X_test_n)
    print('test',r2_score(Y_test, Y_pred))
    # val
    X_val_n = pol.fit_transform(X_val)
    X_val_n.shape,X_train_n.shape,X_test_n.shape
    Y_valpred = reg.predict(X_val_n)
    print('val',r2_score(Y_val, Y_valpred))
    X_names = X.columns.values
    X_new = df_dummies[X_names].dropna()
    X_new = X_new[df_dummies[target].isna()]
    X_new_n = pol.fit_transform(X_new)
    Y_new = df_dummies.loc[X_new.index,target]

    Y_new = reg.predict(X_new_n)
    Y_new = pd.Series(Y_new, index=X_new.index)
    Y_new.head()
    return Y_new, X_names, X_new.index

然后我正在使用这些函数为p_values <0.05的特征填充nan。 但是我不确定这是否是一个好方法。 通过这种方式,许多丢失仍然无法预测。

0 个答案:

没有答案