在python dask中使用分隔符读取csv

时间:2015-12-14 11:45:59

标签: python csv separator dask

我正在尝试通过读取由' #####'分隔的csv文件来创建DataFrame。 5个哈希

代码是:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()

错误是:

dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns.  These first 1,000 rows led us to an incorrect
guess.

For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.

You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.

    df = dd.read_csv(..., dtype={'my-column': float})

Pandas has given us the following error when trying to parse the file:

  "The 'dtype' option is not supported with the 'python' engine"

Traceback
 ---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)

那么如何摆脱它。

如果我遵循错误,那么我必须为每列提供dtype,但如果我有100+列,则没有用。

如果我在没有分隔符的情况下阅读,那么一切都很顺利,但到处都有#####。因此,在将其计算为pandas DataFrame之后,有没有办法摆脱它?

请帮助我。

2 个答案:

答案 0 :(得分:4)

dtype=object的形式读取整个文件,这意味着所有列将被解释为object类型。这应该正确读入,摆脱每一行中的#####。在这里,您可以使用compute()方法将其变成熊猫框架。数据放入pandas框架后,您可以使用pandas infer_objects方法来更新类型,而无需进行操作。

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()

答案 1 :(得分:0)

如果您希望将整个文件保持为淡淡的数据帧,那么只需增加read_csv中采样的字节数,我就对具有大量列的数据集取得了一些成功。

例如:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv', sep='#####', sample = 1000000) # increase to 1e6 bytes
df.head()

这可以解决某些类型推断问题,尽管与本杰明·科恩(Benjamin Cohen)的答案不同,您需要找到正确的值来为sample /

选择