ML模型无法估算值

时间:2020-10-26 14:33:09

标签: python pandas scikit-learn data-science valueerror

我试图创建一个ML模型来做出一些预测,但是我一直遇到绊脚石。即,该代码似乎忽略了我给它的插补指令,从而导致以下错误:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

这是我的代码:

import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostRegressor
from category_encoders import CatBoostEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

data = pd.read_csv("data.csv",index_col=("Unnamed: 0"))
y = data.Installs
x = data.drop("Installs",axis=1)


strat = ["mean","median","most_frequent","constant"]
num_imp = SimpleImputer(strategy=strat[0])
obj_imp = SimpleImputer(strategy=strat[2])

# Set up the scaler
sc = StandardScaler()

# Set up Encoders
cb = CatBoostEncoder()
oh = OneHotEncoder(sparse=True)


# Set up columns
obj = list(x.select_dtypes(include="object"))
num = list(x.select_dtypes(exclude="object"))


cb_col = [i for i in obj if len(x[i].unique())>30]
oh_col = [i for i in obj if len(x[i].unique())<10]

# First Pipeline
imp = make_pipeline((num_imp))
enc_cb = make_pipeline((obj_imp),(cb))
enc_oh = make_pipeline((obj_imp),(oh))

# Col Transformation
col = make_column_transformer((imp,num),
                              (sc,num),
                              (enc_oh,oh_col),
                              (enc_cb,cb_col))
model = AdaBoostRegressor(random_state=(0))

run = make_pipeline((col),(model))
run.fit(x,y)

这是代码中用于再现目的的数据的link。你能说出什么问题吗?谢谢您的时间。

2 个答案:

答案 0 :(得分:0)

如果您检查数据集,则某些字段(例如“评分”字段)中会存在Nan值。这解释了输入错误。处理丢失的数据由您决定,有很多方法可以处理丢失的数据。您可以咨询this pandas doc来帮助您处理此类丢失的数据。

答案 1 :(得分:0)

您的数字比例转换器可能是一个抱怨的对象:在应用StandardScaler之前您还没有进行过估算。可能您想要这样的东西:

imp_sc = make_pipeline((num_imp),(sc))

# Col Transformation
col = make_column_transformer((imp_sc,num),
                              (enc_oh,oh_col),
                              (enc_cb,cb_col))