我尝试使用dask并在dask.dataframe.read_csv中发现了一些似乎是错误的内容。
import dask.dataframe as dd
types = {'id': 'int16', 'Semana': 'uint8', 'Agencia_ID': 'uint16', 'Canal_ID': 'uint8',
'Ruta_SAK': 'uint16' ,'Cliente_ID': 'float32', 'Producto_ID': 'float32'}
name_map = {'Semana': 'week', 'Agencia_ID': 'agency', 'Canal_ID': 'channel',
'Ruta_SAK': 'route', 'Cliente_ID': 'client', 'Producto_ID': 'prod'}
test = dd.read_csv(os.path.join(datadir, 'test.csv'), usecols=types.keys(), dtype=types)
test = test.rename(columns=name_map)
给出:
ValueError:整数列在第1列中具有NA值
但是,相同的pandas read_csv操作完成正常并且不会产生任何NA:
types = {'id': 'int16', 'Semana': 'uint8', 'Agencia_ID': 'uint16', 'Canal_ID': 'uint8',
'Ruta_SAK': 'uint16' ,'Cliente_ID': 'float32', 'Producto_ID': 'float32'}
name_map = {'Semana': 'week', 'Agencia_ID': 'agency', 'Canal_ID': 'channel',
'Ruta_SAK': 'route', 'Cliente_ID': 'client', 'Producto_ID': 'prod'}
test = pd.read_csv(os.path.join(datadir, 'test.csv'), usecols=types.keys(), dtype=types)
test = test.rename(columns=name_map)
test.isnull().any()
id False
week False
agency False
channel False
route False
client False
prod False
dtype: bool
我是否应该将此视为已确定的错误并为其提出JIRA?
完整追溯:
ValueError Traceback(最近一次调用最后一次) in() 4' Ruta_SAK':'路由',' Cliente_ID':'客户',' Producto_ID':' PROD'} 五 ----> 6 test = dd.read_csv(os.path.join(datadir,' test.csv'),usecols = types.keys(),dtype = types) 7 test = test.rename(columns = name_map)
read_csv中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ dask \ dataframe \ csv.pyc(文件名,blocksize,chunkbytes,collection,lineterminator,compression,sample,enforce,storage_options,** kwargs) 195其他: 196 header = sample.split(b_lineterminator)[0] + b_lineterminator - &GT; 197 head = pd.read_csv(BytesIO(样本),** kwargs) 198 199 df = read_csv_from_bytes(values,header,head,kwargs, parser_f中的<:D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(filepath_or_buffer,sep,delimiter,header,names,index_col,usecols,squeeze,prefix,mangle_dupe_cols,dtype,engine, converter,true_values,false_values,skipinitialspace,skiprows,skipfooter,nrows,na_values,keep_default_na,na_filter,verbose,skip_blank_lines,parse_dates,infer_datetime_format,keep_date_col,date_parser,dayfirst,iterator,chunksize,compression,thousands,decimal,lineterminator,quotechar,quoting, escapechar,comment,encoding,dialect,tupleize_cols,error_bad_lines,warn_bad_lines,skip_footer,doublequote,delim_whitespace,as_recarray,compact_ints,use_unsigned,low_memory,buffer_lines,memory_map,float_precision) 560 skip_blank_lines = skip_blank_lines) 561 - &GT; 562 return _read(filepath_or_buffer,kwds) 563 564 parser_f。 name = nameDread中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(filepath_or_buffer,kwds) 323返回解析器 324 - &GT; 325 return parser.read() 326 327 _parser_defaults = {
阅读中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(self,nrows) 813引发ValueError(&#39; skip_footer不支持迭代&#39;) 814 - &GT; 815 ret = self._engine.read(nrows) 816 817 if self.options.get(&#39; as_recarray&#39;): 阅读中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(self,nrows) 1312 def读取(self,nrows = None): 1313尝试: - &GT; 1314 data = self._reader.read(nrows) 1315除StopIteration外: 1316如果是self._first_chunk: pandas.parser.TextReader.read中的pandas \ parser.pyx(pandas \ parser.c:8748)() pandas.parser.TextReader._read_low_memory中的pandas \ parser.pyx(pandas \ parser.c:9003)() pandas.parser.TextReader._read_rows中的pandas \ parser.pyx(pandas \ parser.c:10022)() pandas.parser.TextReader._convert_column_data(pandas \ parser.c:11397)中的pandas \ parser.pyx() pandas.parser.TextReader._convert_tokens中的pandas \ parser.pyx(pandas \ parser.c:12093)() pandas.parser.TextReader._convert_with_dtype中的pandas \ parser.pyx(pandas \ parser.c:13057)()ValueError:整数列在第1列中具有NA值