我正在练习贷款预测实践问题并尝试填写数据中的缺失值。我从Java demo获得了数据。要完成此问题,我将关注此here。
您可以在GitHub上找到我正在使用的整个代码(文件名model.py )和数据tutorial。
DataFrame看起来像这样:
df[['Loan_ID', 'Self_Employed', 'Education', 'LoanAmount']].head(10)
Out:
Loan_ID Self_Employed Education LoanAmount
0 LP001002 No Graduate NaN
1 LP001003 No Graduate 128.0
2 LP001005 Yes Graduate 66.0
3 LP001006 No Not Graduate 120.0
4 LP001008 No Graduate 141.0
5 LP001011 Yes Graduate 267.0
6 LP001013 No Not Graduate 95.0
7 LP001014 No Graduate 158.0
8 LP001018 No Graduate 168.0
9 LP001020 No Graduate 349.0
执行最后一行后(对应于model.py文件中的第60行)
url = 'https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url)
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)
table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
# Define function to return value of this pivot_table
def fage(x):
return table.loc[x['Self_Employed'],x['Education']]
# Replace missing values
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
我收到此错误:
ValueError Traceback (most recent call last)
<ipython-input-40-5146e49c2460> in <module>()
----> 1 df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
2368 axis=axis, inplace=inplace,
2369 limit=limit, downcast=downcast,
-> 2370 **kwargs)
2371
2372 @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs)
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in fillna(self, value, method, axis, inplace, limit, downcast)
3264 else:
3265 raise ValueError("invalid fill value with a %s" %
-> 3266 type(value))
3267
3268 new_data = self._data.fillna(value=value, limit=limit,
ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>
如何在不收到此错误的情况下填写缺失的值?
答案 0 :(得分:1)
教程的作者似乎希望用NaN
的值替换table
。
但首先需要unstack
和set_index
创建Series
来对齐数据。
首先删除NaN
替换为mean
:
url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas
#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)
table = df.pivot_table(values='LoanAmount',
index='Self_Employed',
columns='Education',
aggfunc=np.median)
print (table.unstack())
Education Self_Employed
Graduate No 130.0
Yes 157.5
Not Graduate No 113.0
Yes 130.0
dtype: float64
#check all values with NaN in LoanAmount column
print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate NaN
35 No Graduate NaN
63 No Graduate NaN
81 Yes Graduate NaN
95 No Graduate NaN
102 No Graduate NaN
103 No Graduate NaN
113 Yes Graduate NaN
127 No Graduate NaN
202 No Not Graduate NaN
284 No Graduate NaN
305 No Not Graduate NaN
322 No Not Graduate NaN
338 No Not Graduate NaN
387 No Not Graduate NaN
435 No Graduate NaN
437 No Graduate NaN
479 No Graduate NaN
524 No Graduate NaN
550 Yes Graduate NaN
551 No Not Graduate NaN
605 No Not Graduate NaN
#for check get all indexes where NaNs
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([ 0, 35, 63, 81, 95, 102, 103, 113, 127, 202, 284, 305, 322,
338, 387, 435, 437, 479, 524, 550, 551, 605],
# Replace missing values
df = df.set_index(['Education','Self_Employed'])
df['LoanAmount'].fillna(table.unstack(), inplace=True)
df = df.reset_index()
#check output - filter only indexes where NaNs before
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate 130.0
35 No Graduate 130.0
63 No Graduate 130.0
81 Yes Graduate 157.5
95 No Graduate 130.0
102 No Graduate 130.0
103 No Graduate 130.0
113 Yes Graduate 157.5
127 No Graduate 130.0
202 No Not Graduate 113.0
284 No Graduate 130.0
305 No Not Graduate 113.0
322 No Not Graduate 113.0
338 No Not Graduate 113.0
387 No Not Graduate 113.0
435 No Graduate 130.0
437 No Graduate 130.0
479 No Graduate 130.0
524 No Graduate 130.0
550 Yes Graduate 157.5
551 No Not Graduate 113.0
605 No Not Graduate 113.0
编辑:
更好的解决方案是groupby
apply
,其中NaN
替换为median
:
url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas
#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)
print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate NaN
35 No Graduate NaN
63 No Graduate NaN
81 Yes Graduate NaN
95 No Graduate NaN
102 No Graduate NaN
103 No Graduate NaN
113 Yes Graduate NaN
127 No Graduate NaN
202 No Not Graduate NaN
284 No Graduate NaN
305 No Not Graduate NaN
322 No Not Graduate NaN
338 No Not Graduate NaN
387 No Not Graduate NaN
435 No Graduate NaN
437 No Graduate NaN
479 No Graduate NaN
524 No Graduate NaN
550 Yes Graduate NaN
551 No Not Graduate NaN
605 No Not Graduate NaN
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([ 0, 35, 63, 81, 95, 102, 103, 113, 127, 202, 284, 305, 322,
338, 387, 435, 437, 479, 524, 550, 551, 605],
dtype='int64')
# Replace missing values
df['LoanAmount'] = df.groupby(['Education','Self_Employed'])['LoanAmount']
.apply(lambda x: x.fillna(x.median()))
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate 130.0
35 No Graduate 130.0
63 No Graduate 130.0
81 Yes Graduate 157.5
95 No Graduate 130.0
102 No Graduate 130.0
103 No Graduate 130.0
113 Yes Graduate 157.5
127 No Graduate 130.0
202 No Not Graduate 113.0
284 No Graduate 130.0
305 No Not Graduate 113.0
322 No Not Graduate 113.0
338 No Not Graduate 113.0
387 No Not Graduate 113.0
435 No Graduate 130.0
437 No Graduate 130.0
479 No Graduate 130.0
524 No Graduate 130.0
550 Yes Graduate 157.5
551 No Not Graduate 113.0
605 No Not Graduate 113.0
编辑:
还有另一个问题:
ValueError:输入包含NaN,无穷大或对于dtype来说太大的值(&#39; float64&#39;)。
解决方案是替换NaN
s:
df['Loan_Status'].fillna('No',inplace=True)
df['Credit_History'].fillna(0,inplace=True)
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df, predictor_var,outcome_var)
答案 1 :(得分:1)
这似乎有效:
df = pd.read_csv('01_scratch_train.csv') # work with original data #
df['Self_Employed'].fillna('No', inplace=True)
table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']]
def fage(x):
return table.loc[x['Self_Employed'],x['Education']]
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']] # rechecking all values with NaN in LoanAmount column. No missing values.
答案 2 :(得分:1)
我也遇到了同样的问题。 这是适合我的解决方案, 问题是你正在尝试填充一个空的选择,因为你已经这样做了: df ['LoanAmount']。fillna(df ['LoanAmount']。mean(),inplace = True)
因此,当您选择 df ['LoanAmount']时,isnull()将导致选择为空。 这就是为什么这行代码不起作用的原因: df ['LoanAmount']。fillna(df [df ['LoanAmount']。isnull()]。apply(fage,axis = 1),inplace = True) 强>
尝试在这一行前加上一个#: df ['LoanAmount']。fillna(df ['LoanAmount']。mean(),inplace = True) 代码应该在执行后工作。