Question

我正在练习贷款预测实践问题并尝试填写数据中的缺失值。我从Java demo获得了数据。要完成此问题，我将关注此here。

您可以在GitHub上找到我正在使用的整个代码（文件名model.py ）和数据tutorial。

DataFrame看起来像这样：

df[['Loan_ID', 'Self_Employed', 'Education', 'LoanAmount']].head(10)
Out: 
    Loan_ID Self_Employed     Education  LoanAmount
0  LP001002            No      Graduate         NaN
1  LP001003            No      Graduate       128.0
2  LP001005           Yes      Graduate        66.0
3  LP001006            No  Not Graduate       120.0
4  LP001008            No      Graduate       141.0
5  LP001011           Yes      Graduate       267.0
6  LP001013            No  Not Graduate        95.0
7  LP001014            No      Graduate       158.0
8  LP001018            No      Graduate       168.0
9  LP001020            No      Graduate       349.0

执行最后一行后（对应于model.py文件中的第60行）

url = 'https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url) 
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
# Define function to return value of this pivot_table
def fage(x):
 return table.loc[x['Self_Employed'],x['Education']]
# Replace missing values
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

我收到此错误：

ValueError                                Traceback (most recent call last)
<ipython-input-40-5146e49c2460> in <module>()
----> 1 df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   2368                                           axis=axis, inplace=inplace,
   2369                                           limit=limit, downcast=downcast,
-> 2370                                           **kwargs)
   2371 
   2372     @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs)

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in fillna(self, value, method, axis, inplace, limit, downcast)
   3264                 else:
   3265                     raise ValueError("invalid fill value with a %s" %
-> 3266                                      type(value))
   3267 
   3268                 new_data = self._data.fillna(value=value, limit=limit,

ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>

如何在不收到此错误的情况下填写缺失的值？

Answer 1

教程的作者似乎希望用NaN的值替换table。

但首先需要unstack和set_index创建Series来对齐数据。

首先删除NaN替换为mean：

url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'

df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas

#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

df['Self_Employed'].fillna('No',inplace=True)

table = df.pivot_table(values='LoanAmount', 
                       index='Self_Employed', 
                       columns='Education', 
                       aggfunc=np.median)

print (table.unstack())
Education     Self_Employed
Graduate      No               130.0
              Yes              157.5
Not Graduate  No               113.0
              Yes              130.0
dtype: float64

#check all values with NaN in LoanAmount column
print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate         NaN
35             No      Graduate         NaN
63             No      Graduate         NaN
81            Yes      Graduate         NaN
95             No      Graduate         NaN
102            No      Graduate         NaN
103            No      Graduate         NaN
113           Yes      Graduate         NaN
127            No      Graduate         NaN
202            No  Not Graduate         NaN
284            No      Graduate         NaN
305            No  Not Graduate         NaN
322            No  Not Graduate         NaN
338            No  Not Graduate         NaN
387            No  Not Graduate         NaN
435            No      Graduate         NaN
437            No      Graduate         NaN
479            No      Graduate         NaN
524            No      Graduate         NaN
550           Yes      Graduate         NaN
551            No  Not Graduate         NaN
605            No  Not Graduate         NaN

#for check get all indexes where NaNs
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([  0,  35,  63,  81,  95, 102, 103, 113, 127, 202, 284, 305, 322,
            338, 387, 435, 437, 479, 524, 550, 551, 605],

# Replace missing values
df = df.set_index(['Education','Self_Employed'])
df['LoanAmount'].fillna(table.unstack(), inplace=True)
df = df.reset_index()

#check output - filter only indexes where NaNs before
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate       130.0
35             No      Graduate       130.0
63             No      Graduate       130.0
81            Yes      Graduate       157.5
95             No      Graduate       130.0
102            No      Graduate       130.0
103            No      Graduate       130.0
113           Yes      Graduate       157.5
127            No      Graduate       130.0
202            No  Not Graduate       113.0
284            No      Graduate       130.0
305            No  Not Graduate       113.0
322            No  Not Graduate       113.0
338            No  Not Graduate       113.0
387            No  Not Graduate       113.0
435            No      Graduate       130.0
437            No      Graduate       130.0
479            No      Graduate       130.0
524            No      Graduate       130.0
550           Yes      Graduate       157.5
551            No  Not Graduate       113.0
605            No  Not Graduate       113.0

编辑：

更好的解决方案是groupby apply，其中NaN替换为median：

url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'

df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas

#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

df['Self_Employed'].fillna('No',inplace=True)


print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate         NaN
35             No      Graduate         NaN
63             No      Graduate         NaN
81            Yes      Graduate         NaN
95             No      Graduate         NaN
102            No      Graduate         NaN
103            No      Graduate         NaN
113           Yes      Graduate         NaN
127            No      Graduate         NaN
202            No  Not Graduate         NaN
284            No      Graduate         NaN
305            No  Not Graduate         NaN
322            No  Not Graduate         NaN
338            No  Not Graduate         NaN
387            No  Not Graduate         NaN
435            No      Graduate         NaN
437            No      Graduate         NaN
479            No      Graduate         NaN
524            No      Graduate         NaN
550           Yes      Graduate         NaN
551            No  Not Graduate         NaN
605            No  Not Graduate         NaN

idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([  0,  35,  63,  81,  95, 102, 103, 113, 127, 202, 284, 305, 322,
            338, 387, 435, 437, 479, 524, 550, 551, 605],
           dtype='int64')

# Replace missing values
df['LoanAmount'] = df.groupby(['Education','Self_Employed'])['LoanAmount']
                     .apply(lambda x: x.fillna(x.median()))

print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate       130.0
35             No      Graduate       130.0
63             No      Graduate       130.0
81            Yes      Graduate       157.5
95             No      Graduate       130.0
102            No      Graduate       130.0
103            No      Graduate       130.0
113           Yes      Graduate       157.5
127            No      Graduate       130.0
202            No  Not Graduate       113.0
284            No      Graduate       130.0
305            No  Not Graduate       113.0
322            No  Not Graduate       113.0
338            No  Not Graduate       113.0
387            No  Not Graduate       113.0
435            No      Graduate       130.0
437            No      Graduate       130.0
479            No      Graduate       130.0
524            No      Graduate       130.0
550           Yes      Graduate       157.5
551            No  Not Graduate       113.0
605            No  Not Graduate       113.0

编辑：

还有另一个问题：

ValueError：输入包含NaN，无穷大或对于dtype来说太大的值（＆＃39; float64＆＃39;）。

解决方案是替换NaN s：

df['Loan_Status'].fillna('No',inplace=True)
df['Credit_History'].fillna(0,inplace=True) 

outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']

classification_model(model, df, predictor_var,outcome_var)

Answer 2

这似乎有效：

df = pd.read_csv('01_scratch_train.csv') # work with original data #

df['Self_Employed'].fillna('No', inplace=True)

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)

df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']]

def fage(x):
    return table.loc[x['Self_Employed'],x['Education']]


df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']] # rechecking all values with NaN in LoanAmount column. No missing values.

Answer 3

我也遇到了同样的问题。这是适合我的解决方案，问题是你正在尝试填充一个空的选择，因为你已经这样做了： df ['LoanAmount']。fillna（df ['LoanAmount']。mean（），inplace = True）

因此，当您选择 df ['LoanAmount']时，isnull（）将导致选择为空。这就是为什么这行代码不起作用的原因： df ['LoanAmount']。fillna（df [df ['LoanAmount']。isnull（）]。apply（fage，axis = 1），inplace = True）

尝试在这一行前加上一个＃： df ['LoanAmount']。fillna（df ['LoanAmount']。mean（），inplace = True） 代码应该在执行后工作。

ValueError：带有<class'pandas.core.frame.dataframe'=“”>的填充值无效

3 个答案: