Dask:ValueError:整数列具有NA值

时间:2016-08-18 09:14:30

标签: dask

我尝试使用dask并在dask.dataframe.read_csv中发现了一些似乎是错误的内容。

import dask.dataframe as dd
types = {'id': 'int16', 'Semana': 'uint8', 'Agencia_ID': 'uint16', 'Canal_ID': 'uint8',
         'Ruta_SAK': 'uint16' ,'Cliente_ID': 'float32', 'Producto_ID': 'float32'}
name_map = {'Semana': 'week', 'Agencia_ID': 'agency', 'Canal_ID': 'channel',
            'Ruta_SAK': 'route', 'Cliente_ID': 'client', 'Producto_ID': 'prod'}

test =  dd.read_csv(os.path.join(datadir, 'test.csv'), usecols=types.keys(), dtype=types)
test = test.rename(columns=name_map)

给出:

ValueError:整数列在第1列中具有NA值

但是,相同的pandas read_csv操作完成正常并且不会产生任何NA:

types = {'id': 'int16', 'Semana': 'uint8', 'Agencia_ID': 'uint16', 'Canal_ID': 'uint8',
         'Ruta_SAK': 'uint16' ,'Cliente_ID': 'float32', 'Producto_ID': 'float32'}
name_map = {'Semana': 'week', 'Agencia_ID': 'agency', 'Canal_ID': 'channel',
            'Ruta_SAK': 'route', 'Cliente_ID': 'client', 'Producto_ID': 'prod'}

test =  pd.read_csv(os.path.join(datadir, 'test.csv'), usecols=types.keys(), dtype=types)
test = test.rename(columns=name_map)

test.isnull().any()

id         False
week       False
agency     False
channel    False
route      False
client     False
prod       False
dtype: bool

我是否应该将此视为已确定的错误并为其提出JIRA?

完整追溯:

ValueError Traceback(最近一次调用最后一次)  in()       4' Ruta_SAK':'路由',' Cliente_ID':'客户',' Producto_ID':' PROD'}       五 ----> 6 test = dd.read_csv(os.path.join(datadir,' test.csv'),usecols = types.keys(),dtype = types)       7 test = test.rename(columns = name_map)

read_csv中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ dask \ dataframe \ csv.pyc(文件名,blocksize,chunkbytes,collection,lineterminator,compression,sample,enforce,storage_options,** kwargs)     195其他:     196 header = sample.split(b_lineterminator)[0] + b_lineterminator - > 197 head = pd.read_csv(BytesIO(样本),** kwargs)     198     199 df = read_csv_from_bytes(values,header,head,kwargs,

parser_f中的<:D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(filepath_or_buffer,sep,delimiter,header,names,index_col,usecols,squeeze,prefix,mangle_dupe_cols,dtype,engine, converter,true_values,false_values,skipinitialspace,skiprows,skipfooter,nrows,na_values,keep_default_na,na_filter,verbose,skip_blank_lines,parse_dates,infer_datetime_format,keep_date_col,date_parser,dayfirst,iterator,chunksize,compression,thousands,decimal,lineterminator,quotechar,quoting, escapechar,comment,encoding,dialect,tupleize_cols,error_bad_lines,warn_bad_lines,skip_footer,doublequote,delim_whitespace,as_recarray,compact_ints,use_unsigned,low_memory,buffer_lines,memory_map,float_precision)     560 skip_blank_lines = skip_blank_lines)     561 - &GT; 562 return _read(filepath_or_buffer,kwds)     563     564 parser_f。 name = name

Dread中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(filepath_or_buffer,kwds)     323返回解析器     324 - &GT; 325 return parser.read()     326     327 _parser_defaults = {

阅读中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(self,nrows)     813引发ValueError(&#39; skip_footer不支持迭代&#39;)     814 - &GT; 815 ret = self._engine.read(nrows)     816     817 if self.options.get(&#39; as_recarray&#39;):

阅读中的D:\ PROGLANG \ Anaconda2 \ lib \ site-packages \ pandas \ io \ parsers.pyc(self,nrows)    1312 def读取(self,nrows = None):    1313尝试: - &GT; 1314 data = self._reader.read(nrows)    1315除StopIteration外:    1316如果是self._first_chunk:

pandas.parser.TextReader.read中的pandas \ parser.pyx(pandas \ parser.c:8748)()

pandas.parser.TextReader._read_low_memory中的pandas \ parser.pyx(pandas \ parser.c:9003)()

pandas.parser.TextReader._read_rows中的pandas \ parser.pyx(pandas \ parser.c:10022)()

pandas.parser.TextReader._convert_column_data(pandas \ parser.c:11397)中的pandas \ parser.pyx()

pandas.parser.TextReader._convert_tokens中的pandas \ parser.pyx(pandas \ parser.c:12093)()

pandas.parser.TextReader._convert_with_dtype中的pandas \ parser.pyx(pandas \ parser.c:13057)()

ValueError:整数列在第1列中具有NA值

0 个答案:

没有答案