达阵数据框read_csv错误:无法序列化元组类型的对象

时间:2019-07-29 19:11:17

标签: dask

在Dask Github页面上没有收到解决方案,因此请在这里询问。

Github问题链接:https://github.com/dask/dask/issues/5156

问题:

读取此CSV(https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/data/allyears2k.csv)时遇到以下错误: 代码

from dask.distributed import Client
import dask.dataframe as dd
client = Client()
file = "allyears2k.csv"
df = dd.read_csv(file, encoding='latin-1', blocksize=None)
df.head()

错误

TypeError:('无法序列化元组类型的对象。',“(,(,(,(,,[.parser_f at 0x7f5e922f46a8>,(,,0,None,b'\ n '),b'Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,......

使用以下Pandas代码可以正常工作:

import pandas as pd
datafile = "allyears2k.csv"
df=pd.read_csv(datafile, encoding='latin-1', dtype='object')

以下是版本详细信息:

Python 3
Pandas 0.25.0
OS:
sh-4.2$ cat /etc/release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"

dask and dask distributed: 
2.1.0
2.1.0

即使指定了dtype = object,也添加了(相同)错误的屏幕截图。

enter image description here

1 个答案:

答案 0 :(得分:0)

在大熊猫中,您需要dtype="object",但您没有将其用于Dask。在我的系统上,未指定dtype时,我收到一条有用的消息,告诉我为指定不同的分区而可能要指定的dtype。如果我使用它,或者实际上只是“对象”,那么它将很好地加载:

In [23]: df = dd.read_csv(file, encoding='latin-1', blocksize=None, dtype='object')
    ...: df.head()
Out[23]:
   Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime  ... CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay IsArrDelayed IsDepDelayed
0  1987    10         14         3     741        730     912  ...          NaN          NaN      NaN           NaN               NaN          YES          YES
1  1987    10         15         4     729        730     903  ...          NaN          NaN      NaN           NaN               NaN          YES           NO
2  1987    10         17         6     741        730     918  ...          NaN          NaN      NaN           NaN               NaN          YES          YES
3  1987    10         18         7     729        730     847  ...          NaN          NaN      NaN           NaN               NaN           NO           NO
4  1987    10         19         1     749        730     922  ...          NaN          NaN      NaN           NaN               NaN          YES          YES

Dask 2.1.0(高级) 熊猫0.25.0 Python 3.7.3