Question

在Dask Github页面上没有收到解决方案，因此请在这里询问。

Github问题链接：https://github.com/dask/dask/issues/5156

问题：

读取此CSV（https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/data/allyears2k.csv）时遇到以下错误：代码：

from dask.distributed import Client
import dask.dataframe as dd
client = Client()
file = "allyears2k.csv"
df = dd.read_csv(file, encoding='latin-1', blocksize=None)
df.head()

错误：

TypeError：（'无法序列化元组类型的对象。'，“（，（，（，（，，[.parser_f at 0x7f5e922f46a8>，（，，0，None，b'\ n '），b'Year，Month，DayofMonth，DayOfWeek，DepTime，CRSDepTime，ArrTime，CRSArrTime，UniqueCarrier，FlightNum，TailNum，ActualElapsedTime，CRSElapsedTime，AirTime，ArrDelay，DepDelay，Origin，Dest，Distance，......

使用以下Pandas代码可以正常工作：

import pandas as pd
datafile = "allyears2k.csv"
df=pd.read_csv(datafile, encoding='latin-1', dtype='object')

以下是版本详细信息：

Python 3
Pandas 0.25.0
OS:
sh-4.2$ cat /etc/release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"

dask and dask distributed: 
2.1.0
2.1.0

即使指定了dtype = object，也添加了（相同）错误的屏幕截图。

Answer 1

在大熊猫中，您需要dtype="object"，但您没有将其用于Dask。在我的系统上，未指定dtype时，我收到一条有用的消息，告诉我为指定不同的分区而可能要指定的dtype。如果我使用它，或者实际上只是“对象”，那么它将很好地加载：

In [23]: df = dd.read_csv(file, encoding='latin-1', blocksize=None, dtype='object')
    ...: df.head()
Out[23]:
   Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime  ... CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay IsArrDelayed IsDepDelayed
0  1987    10         14         3     741        730     912  ...          NaN          NaN      NaN           NaN               NaN          YES          YES
1  1987    10         15         4     729        730     903  ...          NaN          NaN      NaN           NaN               NaN          YES           NO
2  1987    10         17         6     741        730     918  ...          NaN          NaN      NaN           NaN               NaN          YES          YES
3  1987    10         18         7     729        730     847  ...          NaN          NaN      NaN           NaN               NaN           NO           NO
4  1987    10         19         1     749        730     922  ...          NaN          NaN      NaN           NaN               NaN          YES          YES

Dask 2.1.0（高级）熊猫0.25.0 Python 3.7.3

达阵数据框read_csv错误：无法序列化元组类型的对象

1 个答案: