pandas:dtype的readitency,而read_json是chunk

时间:2018-02-09 11:12:53

标签: python json pandas parquet dask

TL; DR
可以做些什么来强制pd.read_json读取块的dtypes?

背景
我需要读取当前存储在行delimeted json中的大数据集,大约300万行。 我试图在小木地板文件中剪切它,以便能够使用dask流式传输完整的数据集。

我的基本想法是:

_chunks =pd.read_json('data.json', lines=True, chunksize=5000)
i = 0
for c in _chunks:
   c.to_parquet('parquet/data.%s.pqt' % i)
   i = i+1

ddf = dataframe.read_parquet('parquet/*', index='_id')
ddf.compute()

但是我在dtypes中出现一些不一致的错误,仅针对某些分区:

>>> ddf.get_partition(8).compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/base.py", line 135, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/base.py", line 333, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/dask/dataframe/io/parquet.py", line 335, in _read_parquet_row_group
    open=open, assign=views, scheme=scheme)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 284, in read_row_group_file
    scheme=scheme)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 334, in read_row_group
    cats, selfmade, assign=assign)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 311, in read_row_group_arrays
    catdef=out[name+'-catdef'] if use else None)
  File "/home/jai/usr/vendors/anaconda3/lib/python3.5/site-packages/fastparquet/core.py", line 266, in read_col
    piece[:] = dic[val]
ValueError: invalid literal for int() with base 10: ''

所以我的见解是在转换为镶木地板之前在读取json时强制使用dtype,我也是如此。

编辑:我的意思是强制浮动类型而不是自动发现的int,因为NaN是float的正确值而不是int

我跟着this tutorial在阅读数据时选择类型部分)来制作一个好的dtype词典:

_index = ddf.dtypes.index

obj = 'object'
f64 = 'float64'
f16 = 'float16'
f32 = 'float32'
i64 = 'int64'
i16 = 'int16'
i32 = 'int32'
cat = 'category'

_new_types = [ obj, obj, f64, f64, obj, cat, i32, cat, cat, cat, cat, cat, cat, cat, obj, cat, f16, obj, f16, f64, f64, f64, f64, cat, f64, cat ]
_column_types = dict(zip(_index, _new_types))
_chunks =pd.read_json('data.json', lines=True, chunksize=5000, dtype=_column_types)

问题在于,当我检查块时,它们并不都具有相同的dtypes!

for c in _chunks:
    c.dtypes
    # print some columns as bool or int64 or object dependending of the chunk

1 个答案:

答案 0 :(得分:1)

在我看来,你能做的最简单的事就是在写作之前强制使用dtypes。由于这似乎与read_json函数无法正常工作,因此您可以应用它

for c in _chunks:
c.astype(_column_types).to_parquet('parquet/data.%s.pqt' % i)
i = i+1

请注意,我会考虑每个镶木地板文件5000个记录太小,无法充分利用该格式。每个组件镶木地板文件的典型尺寸通常> 10MB。