我是数据科学的新手。我想在Jupyter Notebook中对数据集应用预处理。以下是我到目前为止所做的事情:
import pandas as pd
import numpy as np
from sklearn import preprocessing
country = pd.read_csv('data.csv', encoding='utf_8')
但它给了我这个错误:
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-19-80e6ff7ff11c> in <module>()
----> 1 country = pd.read_csv('data.csv', encoding='utf_8')
/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
707 skip_blank_lines=skip_blank_lines)
708
--> 709 return _read(filepath_or_buffer, kwds)
710
711 parser_f.__name__ = name
/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
453
454 try:
--> 455 data = parser.read(nrows)
456 finally:
457 parser.close()
/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
1067 raise ValueError('skipfooter not supported for iteration')
1068
-> 1069 ret = self._engine.read(nrows)
1070
1071 if self.options.get('as_recarray'):
/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
1837 def read(self, nrows=None):
1838 try:
-> 1839 data = self._reader.read(nrows)
1840 except StopIteration:
1841 if self._first_chunk:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 63
我还尝试了一些其他编码,例如:latin1,iso-8859-1和more
答案 0 :(得分:1)
有问题需要在read_csv
中通过参数skiprows
省略前4行:
df = pd.read_csv('data.csv', skiprows=4)
print (df.head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW Population, total SP.POP.TOTL 54211.0
1 Afghanistan AFG Population, total SP.POP.TOTL 8996351.0
2 Angola AGO Population, total SP.POP.TOTL 5643182.0
3 Albania ALB Population, total SP.POP.TOTL 1608800.0
4 Andorra AND Population, total SP.POP.TOTL 13411.0
1961 1962 1963 1964 1965 ... \
0 55438.0 56225.0 56695.0 57032.0 57360.0 ...
1 9166764.0 9345868.0 9533954.0 9731361.0 9938414.0 ...
2 5753024.0 5866061.0 5980417.0 6093321.0 6203299.0 ...
3 1659800.0 1711319.0 1762621.0 1814135.0 1864791.0 ...
4 14375.0 15370.0 16412.0 17469.0 18549.0 ...
2009 2010 2011 2012 2013 2014 \
0 101453.0 101669.0 102053.0 102577.0 103187.0 103795.0
1 28004331.0 28803167.0 29708599.0 30696958.0 31731688.0 32758020.0
2 22549547.0 23369131.0 24218565.0 25096150.0 25998340.0 26920466.0
3 2927519.0 2913021.0 2905195.0 2900401.0 2895092.0 2889104.0
4 84462.0 84449.0 83751.0 82431.0 80788.0 79223.0
2015 2016 2017 Unnamed: 62
0 104341.0 104822.0 NaN NaN
1 33736494.0 34656032.0 NaN NaN
2 27859305.0 28813463.0 NaN NaN
3 2880703.0 2876101.0 NaN NaN
4 78014.0 77281.0 NaN NaN
[5 rows x 63 columns]
如果要删除所有NaN
列,请添加dropna
:
print (df.dropna(how='all', axis=1).head())
Country Name Country Code Indicator Name Indicator Code 1960 \
0 Aruba ABW Population, total SP.POP.TOTL 54211.0
1 Afghanistan AFG Population, total SP.POP.TOTL 8996351.0
2 Angola AGO Population, total SP.POP.TOTL 5643182.0
3 Albania ALB Population, total SP.POP.TOTL 1608800.0
4 Andorra AND Population, total SP.POP.TOTL 13411.0
1961 1962 1963 1964 1965 ... \
0 55438.0 56225.0 56695.0 57032.0 57360.0 ...
1 9166764.0 9345868.0 9533954.0 9731361.0 9938414.0 ...
2 5753024.0 5866061.0 5980417.0 6093321.0 6203299.0 ...
3 1659800.0 1711319.0 1762621.0 1814135.0 1864791.0 ...
4 14375.0 15370.0 16412.0 17469.0 18549.0 ...
2007 2008 2009 2010 2011 2012 \
0 101220.0 101353.0 101453.0 101669.0 102053.0 102577.0
1 26616792.0 27294031.0 28004331.0 28803167.0 29708599.0 30696958.0
2 20997687.0 21759420.0 22549547.0 23369131.0 24218565.0 25096150.0
3 2970017.0 2947314.0 2927519.0 2913021.0 2905195.0 2900401.0
4 82683.0 83861.0 84462.0 84449.0 83751.0 82431.0
2013 2014 2015 2016
0 103187.0 103795.0 104341.0 104822.0
1 31731688.0 32758020.0 33736494.0 34656032.0
2 25998340.0 26920466.0 27859305.0 28813463.0
3 2895092.0 2889104.0 2880703.0 2876101.0
4 80788.0 79223.0 78014.0 77281.0
[5 rows x 61 columns]