在Dask Github页面上没有收到解决方案,因此请在这里询问。
Github问题链接:https://github.com/dask/dask/issues/5156
问题:
读取此CSV(https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/data/allyears2k.csv)时遇到以下错误: 代码:
from dask.distributed import Client
import dask.dataframe as dd
client = Client()
file = "allyears2k.csv"
df = dd.read_csv(file, encoding='latin-1', blocksize=None)
df.head()
错误:
TypeError:('无法序列化元组类型的对象。',“(,(,(,(,,[.parser_f at 0x7f5e922f46a8>,(,,0,None,b'\ n '),b'Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,......
使用以下Pandas代码可以正常工作:
import pandas as pd
datafile = "allyears2k.csv"
df=pd.read_csv(datafile, encoding='latin-1', dtype='object')
以下是版本详细信息:
Python 3
Pandas 0.25.0
OS:
sh-4.2$ cat /etc/release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
dask and dask distributed:
2.1.0
2.1.0
即使指定了dtype = object,也添加了(相同)错误的屏幕截图。
答案 0 :(得分:0)
在大熊猫中,您需要dtype="object"
,但您没有将其用于Dask。在我的系统上,未指定dtype时,我收到一条有用的消息,告诉我为指定不同的分区而可能要指定的dtype。如果我使用它,或者实际上只是“对象”,那么它将很好地加载:
In [23]: df = dd.read_csv(file, encoding='latin-1', blocksize=None, dtype='object')
...: df.head()
Out[23]:
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime ... CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay IsArrDelayed IsDepDelayed
0 1987 10 14 3 741 730 912 ... NaN NaN NaN NaN NaN YES YES
1 1987 10 15 4 729 730 903 ... NaN NaN NaN NaN NaN YES NO
2 1987 10 17 6 741 730 918 ... NaN NaN NaN NaN NaN YES YES
3 1987 10 18 7 729 730 847 ... NaN NaN NaN NaN NaN NO NO
4 1987 10 19 1 749 730 922 ... NaN NaN NaN NaN NaN YES YES
Dask 2.1.0(高级) 熊猫0.25.0 Python 3.7.3