动态对Dask DataFrame进行分区

时间:2018-07-09 10:47:38

标签: dask dask-distributed

这是以下问题的后续内容:

...答案是,没有内置的方式可以在dask中动态地重新分区DataFrame。

下面,我提出一个可能的解决方案,问题是:这是一个好的解决方案吗?有什么明显的问题吗?有没有更好的方法可以做到这一点?

def repartition(*futs, size, client):
    """Repattitions the DataFrames held in `futs` into frames of size `size`.

    Parameters
    ----------
    futs : Iterable[Future] or Future
        Each future holds a DataFrame with the same structure but potentially
        of different sizes.

    size : int
        The size of each new DataFrame

    client : dask.distributed.Client
        The Client instance used to submit the partition jobs

    Returns
    -------
    Iterator[Future]
       As the underlying futures complete new futures are yielded which
       hold the repartitioned DataFrames

    """
    start = 0
    # The cumulative size of the frames in the partition
    cumsize = []  # List[int]
    # The futures which make up the new partition
    frames = []  # List[Future]
    for fut in as_completed(futs):
        frames.append(fut)
        length = client.submit(len, fut).result()
        fut.length = length
        total_length = length if not cumsize else length + cumsize[-1]
        cumsize.append(total_length)
        if total_length < size:
            continue
        while cumsize[-1] >= size:
            # yield chunks until the data is exhausted
            yield client.submit(
                partial(partition_frames, start=start, size=size),
                frames
            )
            res = list(zip(*(
                (old_size - size, fut)
                for old_size, fut
                in zip(cumsize, frames)
                if old_size > size
            )))
            if not res:
                cumsize, frames = [], []
                break
            # if there is data remaining carry it forward to the next chunk
            cumsize, frames = map(list, res)
            start = frames[0].length - cumsize[0]
        if not cumsize:
            start = 0
        else:
            start = length - cumsize[-1]
    if frames:
        yield client.submit(
            partial(partition_frames, start=start, size=size),
            frames
        )

def partition_frames(frames, *, start=0, size=None):
    import pandas as pd
    df = pd.concat(frames)
    end = None if size is None else start + size
    return df.iloc[start:end]

该代码似乎有效:

df = pd.DataFrame(np.arange(10000), columns=['A'])
partitions = np.random.rand(50).cumsum()
partitions /= partitions[-1]
partitions = (10000*partitions).astype(int)
partitions = np.append(0, partitions)
slices = list(map(slice, partitions[:-1], partitions[1:]))
futs = [
    client.submit(lambda s, df=df: df[s], s)
    for s in slices
]

...但是考虑到它所做的全部工作就是连接DataFrame并返回切片,这似乎需要一段时间:

%%time
partitions = list(repartition(*futs, size=1000, client=client))
res = pd.concat(client.gather(partitions)).equals(df)
print(res)

True
Wall time: 1.21 s

这对我的应用程序来说不是问题,但这确实使我想知道是否没有更有效的方法来实现同一目标?

0 个答案:

没有答案