试图弄清楚为什么我一直把消息列为该问题的标题。我想我已经清除了数据,删除了NaN。有人可以帮我吗?
查看一个包含11K行的数据集,我正在尝试使代码训练数据来预测辍学学生的水平。使用一台普通的Windows笔记本电脑,同时还可以进行更好的数据分析。
# divide the data set into categorial and non categorial features and apply models to get the insight of the data
print("\nDEFINING CATEGORICAL AND NUMERICAL FEATURES")
categorical_features = X.select_dtypes(include=['object']).columns
print(categorical_features)
numerical_features = X.select_dtypes(exclude = ["object"]).columns
print(numerical_features)
print("\nDIVIDE THE DATA SET INTO CATEGORIAL AND NON CATEGORIAL FEATURES AND APPLY MODELS TO GET THE INSIGHT OF THE DATA")
print("Numerical features : " + str(len(numerical_features)))
print("Categorical features : " + str(len(categorical_features)))
print("\nFILLING THE MISSING VALUE OF TEST WITH THEIR MEAN VALUE, FOR BETTER ACCURACY")
test = test.select_dtypes(exclude=[np.object])
test.info()
test = test.fillna(test.mean(), inplace=True)
print("\nAPPLYING MODEL RANDOM FOREST REGRESSOR")
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
# pull data into target (y) and predictors (X)
predictor_cols = ['F18 ECTS på kurser med beståede talkarakter']
# -------------------------------------------
# Create training predictors data
train_X = X[predictor_cols]
my_model = RandomForestRegressor()
my_model.fit(train_X, y)
my_model.score(train_X, y)
print(predictor_cols)
print(my_model.score(train_X, y))
test = pd.read_csv("…_test.csv")
# -------------------------------------------
print("\nPRINT PREDICTED FACTORS")
test_X = test[predictor_cols]
# model to make predictions
predicted_factor = my_model.predict(test_X)
# at the predicted prices to ensure something sensible.
print(predicted_factor)
让我的大多数代码运行正常,除了:
APPLYING MODEL RANDOM FOREST REGRESSOR
Traceback (most recent call last):
File "C:/Users/jcst/PycharmProjects/Frafaldsanalyse/DefiningCatAndNumFeatures_4_new.py", line 142, in <module>
my_model.fit(train_X, y)
File "C:\Users\jcst\PycharmProjects\Frafaldsanalyse\venv\lib\site-packages\sklearn\ensemble\forest.py", line 250, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "C:\Users\jcst\PycharmProjects\Frafaldsanalyse\venv\lib\site-packages\sklearn\utils\validation.py", line 573, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Users\jcst\PycharmProjects\Frafaldsanalyse\venv\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Process finished with exit code 1
答案 0 :(得分:1)
如上所述,您的数据集X_train
或y
必须包含nan
个。再次检查以查看其来源。通常是由于被0除或数学函数域错误(如负值的对数)引起的。
您可能会在之后遇到其他事情:
您正在使用test = test.fillna(test.mean(), inplace=True)
您应该使用test = test.fillna(test.mean())
或test.fillna(test.mean(), inplace=True)
指定inplace=True
时,该函数返回None
,因此test
为None
。
由于您稍后将通过读取DataFrame来覆盖test
,因此您无需使用所有功能即可完成所有操作。也许您在这里有意想不到的行为。