Question

我正在尝试清除开源数据中的np值。

我正在使用python3，Jupyter和pandas。

 response = urllib.request.urlopen('https://resources.lendingclub.com/LoanStats3c.csv.zip')
 import shutil
 url = 'https://resources.lendingclub.com/LoanStats3c.csv.zip'
 file_name = 'LoanStats3c.csv.zip'

 with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
     shutil.copyfileobj(response, out_file)
     with zipfile.ZipFile(file_name) as zf:
         zf.extractall()

 loan=pd.read_csv(open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id') 
 loan.describe()

 # remove all columns with all NAs
 loan = loan.dropna(axis=1, how = 'all')
 loan.describe()

 # remove all rows with any NAs
 loan = loan.dropna(axis = 0)

 loan.describe()

但是，结果是所有具有所有NA的列：

  loan_amnt  funded_amnt  funded_amnt_inv  installment  annual_inc  dti  \
  count        0.0          0.0              0.0          0.0         0.0  0.0   
  mean         NaN          NaN              NaN          NaN         NaN  NaN    
  std          NaN          NaN              NaN          NaN         NaN  NaN   
  min          NaN          NaN              NaN          NaN         NaN  NaN   
  25%          NaN          NaN              NaN          NaN         NaN  NaN   
  50%          NaN          NaN              NaN          NaN         NaN  NaN   
  75%          NaN          NaN              NaN          NaN         NaN  NaN   
  max          NaN          NaN              NaN          NaN         NaN  NaN

为什么所有具有有效值的行都消失了，只剩下NA列了？

谢谢

Answer 1

当您使用.dropna（）时，所有具有NaN值的事件都会从数据框中删除

loan.dropna(axis=1, how = 'all')

将删除具有NaN中所有值的列

loan.dropna(axis = 0)

将删除NaN中具有至少一个值的行

我看到了文件，并且我很确定每一行在NaN中至少有一列

最后，当使用 .describe（）时，数据框为空，并且所显示的值是该数据的描述性统计信息，如果您想查看实际的DF，请使用print（df）或jupyter只需将变量放在代码块的末尾

some code
some code
some code 
variable = pd.DataFrame([])

#print(variable)
variable

这将向您显示变量的值

dropna删除所有具有有效值的行，并且熊猫中仅保留NA行

1 个答案: