cs-training.csv就像:
+----+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
| | SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents |
+----+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
| 1 | 1 | 0.766126609 | 45 | 2 | 0.802982129 | 9120 | 13 | 0 | 6 | 0 | 2 |
| 2 | 0 | 0.957151019 | 40 | 0 | 0.121876201 | 2600 | 4 | 0 | 0 | 0 | 1 |
| 3 | 0 | 0.65818014 | 38 | 1 | 0.085113375 | 3042 | 2 | 1 | 0 | 0 | 0 |
| 4 | 0 | 0.233809776 | 30 | 0 | 0.036049682 | 3300 | 5 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0.9072394 | 49 | 1 | 0.024925695 | 63588 | 7 | 0 | 1 | 0 | 0 |
| 6 | 0 | 0.213178682 | 74 | 0 | 0.375606969 | 3500 | 3 | 0 | 1 | 0 | 1 |
| 7 | 0 | 0.305682465 | 57 | 0 | 5710 | NA | 8 | 0 | 3 | 0 | 0 |
| 8 | 0 | 0.754463648 | 39 | 0 | 0.209940017 | 3500 | 8 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0.116950644 | 27 | 0 | 46 | NA | 2 | 0 | 0 | 0 | NA |
| 10 | 0 | 0.189169052 | 57 | 0 | 0.606290901 | 23684 | 9 | 0 | 4 | 0 | 2 |
| 11 | 0 | 0.644225962 | 30 | 0 | 0.30947621 | 2500 | 5 | 0 | 0 | 0 | 0 |
| 12 | 0 | 0.01879812 | 51 | 0 | 0.53152876 | 6501 | 7 | 0 | 2 | 0 | 2 |
| 13 | 0 | 0.010351857 | 46 | 0 | 0.298354075 | 12454 | 13 | 0 | 2 | 0 | 2 |
| 14 | 1 | 0.964672555 | 40 | 3 | 0.382964747 | 13700 | 9 | 3 | 1 | 1 | 2 |
| 15 | 0 | 0.019656581 | 76 | 0 | 477 | 0 | 6 | 0 | 1 | 0 | 0 |
| 16 | 0 | 0.548458062 | 64 | 0 | 0.209891754 | 11362 | 7 | 0 | 1 | 0 | 2 |
| 17 | 0 | 0.061086118 | 78 | 0 | 2058 | NA | 10 | 0 | 2 | 0 | 0 |
| 18 | 0 | 0.166284079 | 53 | 0 | 0.18827406 | 8800 | 7 | 0 | 0 | 0 | 0 |
| 19 | 0 | 0.221812771 | 43 | 0 | 0.527887839 | 3280 | 7 | 0 | 1 | 0 | 2 |
| 20 | 0 | 0.602794411 | 25 | 0 | 0.065868263 | 333 | 2 | 0 | 0 | 0 | 0 |
| 21 | 0 | 0.200923382 | 43 | 0 | 0.430046338 | 12300 | 10 | 0 | 2 | 0 | 0 |
+----+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
# using RF to predict and fill null
def set_missing(df):
process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]
known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()
unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()
X = known[:, 1:]
y = known[:, 0]
rfr = RandomForestRegressor(random_state=0, n_estimators=200,max_depth=3,n_jobs=-1)
rfr.fit(X,y)
predicted = rfr.predict(unknown[:, 1:]).round(0)
print(predicted)
# fill null,and this line goes wrong
df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predicted
return df
if __name__ == '__main__':
data = pd.read_csv('cs-training.csv')
data.describe().to_csv('DataDescribe.csv')
data=set_missing(data)
data=data.dropna()
data = data.drop_duplicates()
data.to_csv('MissingData.csv',index=False)
data.describe().to_csv('MissingDataDescribe.csv')
我检查了有关“ ValueError:输入包含NaN,无穷大或值对于dtype('float32')而言过大”的页面,但是我的情况似乎不同。可能有人知道为什么以及如何解决好心帮助。谢谢!
-------------------------------------------------- ---------------------------- ValueError Traceback(最近的呼叫 最后)在() ----> 1个data = set_missing(data)
set_missing(df)中的13 rfr.fit(X,y) 14 -> 15个预测= rfr.predict(unknown [:, 1:])。round(0) 16张(预计) 17
D:\ Program文件 (x86)\ Anaconda3 \ lib \ site-packages \ sklearn \ ensemble \ forest.py在 预测(自己,X) 683“”“ 684#检查数据 -> 685 X = self._validate_X_predict(X) 686 687#为工作分配树木
D:\ Program文件 (x86)\ Anaconda3 \ lib \ site-packages \ sklearn \ ensemble \ forest.py在 _validate_X_predict(X) 353“在利用模型之前先呼叫
fit
。”) 354 -> 355返回self.estimators_ [0] ._ validate_X_predict(X,check_input = True) 356 357 @propertyD:\ Program文件 (x86)\ Anaconda3 \ lib \ site-packages \ sklearn \ tree \ tree.py在 _validate_X_predict(self,X,check_input) 363 第364章 -> 365 X = check_array(X,dtype = DTYPE,accept_sparse =“ csr”) 366 = 367 X.indptr.dtype!= np.intc):
D:\ Program文件 (x86)\ Anaconda3 \ lib \ site-packages \ sklearn \ utils \ validation.py在 check_array(array,accept_sparse,dtype,order,copy, force_all_finite,ensure_2d,allow_nd,ensure_min_samples, sure_min_features,warn_on_dtype,estimator) 405%(array.ndim,estimator_name)) 第406章 -> 407 _assert_all_finite(数组) 408 409 shape_repr = _shape_repr(array.shape)
D:\ Program文件 (x86)\ Anaconda3 \ lib \ site-packages \ sklearn \ utils \ validation.py在 _assert_all_finite(X) 56而不是np.isfinite(X).all()): 57提高ValueError(“输入包含NaN,无穷大” ---> 58“或%r的值太大。” %X.dtype) 59 60
ValueError:输入包含NaN,无穷大或值对于 dtype('float32')。