Dask-删除重复索引MemoryError

时间:2018-07-12 08:44:26

标签: dask

当我尝试使用以下代码在大型数据帧上删除重复的时间戳时,我得到了MemoryError

import dask.dataframe as dd

path = f's3://{container_name}/*'
ddf = dd.read_parquet(path, storage_options=opts, engine='fastparquet')
ddf = ddf.reset_index().drop_duplicates(subset='timestamp_utc').set_index('timestamp_utc')
...

分析表明,在包含约4000万行数据的265MB压缩拼花文件的数据集上,它正在使用约14GB的内存。

是否有另一种方法可以在Dask上使用大量内存的情况下删除数据上的重复索引?

下面的追溯

Traceback (most recent call last):
  File "/anaconda/envs/surb/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda/envs/surb/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/chengkai/surbana_lift/src/consolidate_data.py", line 62, in <module>
    consolidate_data()
  File "/home/chengkai/surbana_lift/src/consolidate_data.py", line 37, in consolidate_data
    ddf = ddf.reset_index().drop_duplicates(subset='timestamp_utc').set_index('timestamp_utc')
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/core.py", line 2524, in set_index
    divisions=divisions, **kwargs)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 64, in set_index
    divisions, sizes, mins, maxes = base.compute(divisions, sizes, mins, maxes)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/base.py", line 407, in compute
    results = get(dsk, keys, **kwargs)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 270, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 270, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 267, in _execute_task
    return [_execute_task(a, cache) for a in arg]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 267, in <listcomp>
    return [_execute_task(a, cache) for a in arg]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/core.py", line 69, in _concat
    return args[0] if not args2 else methods.concat(args2, uniform=True)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/methods.py", line 329, in concat
    out = pd.concat(dfs3, join=join)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 226, in concat
    return op.get_result()
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 423, in get_result
    copy=self.copy)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/internals.py", line 5418, in concatenate_block_manage
rs
    [ju.block for ju in join_units], placement=placement)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/internals.py", line 2984, in concat_same_type
    axis=self.ndim - 1)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/dtypes/concat.py", line 461, in _concat_datetime
    return _concat_datetimetz(to_concat)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/dtypes/concat.py", line 506, in _concat_datetimetz
    new_values = np.concatenate([x.asi8 for x in to_concat])
MemoryError

1 个答案:

答案 0 :(得分:2)

数据在内存中变得很大并不奇怪。就空间而言,Parquet是一种非常有效的格式,尤其是使用gzip压缩时,字符串全部成为python对象(在内存中非常昂贵)。

此外,您有许多工作线程在整个数据帧的某些部分上运行。这涉及数据复制,中间结果和结果的串联;后者在熊猫中效率很低。

一个建议:您可以通过将var contacts = [ { "firstName": "Akira", "lastName": "Laine", "number": "0543236543", "likes": ["Pizza", "Coding", "Brownie Points"] }, { "firstName": "Harry", "lastName": "Potter", "number": "0994372684", "likes": ["Hogwarts", "Magic", "Hagrid"] }, { "firstName": "Sherlock", "lastName": "Holmes", "number": "0487345643", "likes": ["Intriguing Cases", "Violin"] }, { "firstName": "Kristian", "lastName": "Vos", "number": "unknown", "likes": ["JavaScript", "Gaming", "Foxes"] } ]; function printValues(){ for(var a = 0; a< contacts.length; a++){ contacts[a].firstName; } } function isNameExist(name ){ for(var a = 0; a< contacts.length; a++){ if (contacts[a].firstName == name) return true } return false; } function isPropertyExist(prop){ for(var a = 0; a< contacts.length; a++){ if (contacts[a].hasOwnProperty (prop)) return true } return false; } function lookUpProfile(name, prop){ // Only change code below this line if(!isNameExist(name)){ return "No such contact"; }else if(!isPropertyExist(prop)){ return "No such property"; } for(var a = 0; a< contacts.length; a++){ if(contacts[a].firstName == name && contacts[a].hasOwnProperty(prop)){ return contacts[a][prop]; } } } // Only change code above this line // Change these values to test your function lookUpProfile("Akira", "likes"); 设置为reset_index来代替index=False

下一个建议:将您使用的线程数限制为小于默认数量,这可能是您的CPU内核数量。最简单的方法是使用分布式客户端进程内

read_parquet

最好先设置索引,然后再与from dask.distributed import Client c = Client(processes=False, threads_per_worker=4) 进行drop_duplicated,以最大程度地减少跨分区通信。

map_partitions