Question

我是数据科学的新手。我想在Jupyter Notebook中对数据集应用预处理。以下是我到目前为止所做的事情：

import pandas as pd
import numpy as np
from sklearn import preprocessing

country = pd.read_csv('data.csv', encoding='utf_8')

但它给了我这个错误：

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-19-80e6ff7ff11c> in <module>()
----> 1 country = pd.read_csv('data.csv', encoding='utf_8')

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    453 
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:
    457         parser.close()

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1067                 raise ValueError('skipfooter not supported for iteration')
   1068 
-> 1069         ret = self._engine.read(nrows)
   1070 
   1071         if self.options.get('as_recarray'):

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1837     def read(self, nrows=None):
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:
   1841             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 63

我还尝试了一些其他编码，例如：latin1，iso-8859-1和more

Link to CSV

Answer 1

有问题需要在read_csv中通过参数skiprows省略前4行：

df = pd.read_csv('data.csv', skiprows=4)
print (df.head())

  Country Name Country Code     Indicator Name Indicator Code       1960  \
0        Aruba          ABW  Population, total    SP.POP.TOTL    54211.0   
1  Afghanistan          AFG  Population, total    SP.POP.TOTL  8996351.0   
2       Angola          AGO  Population, total    SP.POP.TOTL  5643182.0   
3      Albania          ALB  Population, total    SP.POP.TOTL  1608800.0   
4      Andorra          AND  Population, total    SP.POP.TOTL    13411.0   

        1961       1962       1963       1964       1965     ...       \
0    55438.0    56225.0    56695.0    57032.0    57360.0     ...        
1  9166764.0  9345868.0  9533954.0  9731361.0  9938414.0     ...        
2  5753024.0  5866061.0  5980417.0  6093321.0  6203299.0     ...        
3  1659800.0  1711319.0  1762621.0  1814135.0  1864791.0     ...        
4    14375.0    15370.0    16412.0    17469.0    18549.0     ...        

         2009        2010        2011        2012        2013        2014  \
0    101453.0    101669.0    102053.0    102577.0    103187.0    103795.0   
1  28004331.0  28803167.0  29708599.0  30696958.0  31731688.0  32758020.0   
2  22549547.0  23369131.0  24218565.0  25096150.0  25998340.0  26920466.0   
3   2927519.0   2913021.0   2905195.0   2900401.0   2895092.0   2889104.0   
4     84462.0     84449.0     83751.0     82431.0     80788.0     79223.0   

         2015        2016  2017  Unnamed: 62  
0    104341.0    104822.0   NaN          NaN  
1  33736494.0  34656032.0   NaN          NaN  
2  27859305.0  28813463.0   NaN          NaN  
3   2880703.0   2876101.0   NaN          NaN  
4     78014.0     77281.0   NaN          NaN  

[5 rows x 63 columns]

如果要删除所有NaN列，请添加dropna：

print (df.dropna(how='all', axis=1).head())
  Country Name Country Code     Indicator Name Indicator Code       1960  \
0        Aruba          ABW  Population, total    SP.POP.TOTL    54211.0   
1  Afghanistan          AFG  Population, total    SP.POP.TOTL  8996351.0   
2       Angola          AGO  Population, total    SP.POP.TOTL  5643182.0   
3      Albania          ALB  Population, total    SP.POP.TOTL  1608800.0   
4      Andorra          AND  Population, total    SP.POP.TOTL    13411.0   

        1961       1962       1963       1964       1965     ...      \
0    55438.0    56225.0    56695.0    57032.0    57360.0     ...       
1  9166764.0  9345868.0  9533954.0  9731361.0  9938414.0     ...       
2  5753024.0  5866061.0  5980417.0  6093321.0  6203299.0     ...       
3  1659800.0  1711319.0  1762621.0  1814135.0  1864791.0     ...       
4    14375.0    15370.0    16412.0    17469.0    18549.0     ...       

         2007        2008        2009        2010        2011        2012  \
0    101220.0    101353.0    101453.0    101669.0    102053.0    102577.0   
1  26616792.0  27294031.0  28004331.0  28803167.0  29708599.0  30696958.0   
2  20997687.0  21759420.0  22549547.0  23369131.0  24218565.0  25096150.0   
3   2970017.0   2947314.0   2927519.0   2913021.0   2905195.0   2900401.0   
4     82683.0     83861.0     84462.0     84449.0     83751.0     82431.0   

         2013        2014        2015        2016  
0    103187.0    103795.0    104341.0    104822.0  
1  31731688.0  32758020.0  33736494.0  34656032.0  
2  25998340.0  26920466.0  27859305.0  28813463.0  
3   2895092.0   2889104.0   2880703.0   2876101.0  
4     80788.0     79223.0     78014.0     77281.0  

[5 rows x 61 columns]

pandas read_csv函数出错

1 个答案: