我正在尝试清除开源数据中的np值。
我正在使用python3,Jupyter和pandas。
response = urllib.request.urlopen('https://resources.lendingclub.com/LoanStats3c.csv.zip')
import shutil
url = 'https://resources.lendingclub.com/LoanStats3c.csv.zip'
file_name = 'LoanStats3c.csv.zip'
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(file_name) as zf:
zf.extractall()
loan=pd.read_csv(open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loan.describe()
# remove all columns with all NAs
loan = loan.dropna(axis=1, how = 'all')
loan.describe()
# remove all rows with any NAs
loan = loan.dropna(axis = 0)
loan.describe()
但是,结果是所有具有所有NA的列:
loan_amnt funded_amnt funded_amnt_inv installment annual_inc dti \
count 0.0 0.0 0.0 0.0 0.0 0.0
mean NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN
为什么所有具有有效值的行都消失了,只剩下NA列了?
谢谢
答案 0 :(得分:0)
当您使用.dropna()时,所有具有NaN值的事件都会从数据框中删除
loan.dropna(axis=1, how = 'all')
将删除具有NaN中所有值的列
loan.dropna(axis = 0)
将删除NaN中具有至少一个值的行
我看到了文件,并且我很确定每一行在NaN中至少有一列
最后,当使用 .describe()时,数据框为空,并且所显示的值是该数据的描述性统计信息,如果您想查看实际的DF,请使用print(df)或jupyter只需将变量放在代码块的末尾
some code
some code
some code
variable = pd.DataFrame([])
#print(variable)
variable
这将向您显示变量的值