这是以下问题的后续内容:
...答案是,没有内置的方式可以在dask中动态地重新分区DataFrame。
下面,我提出一个可能的解决方案,问题是:这是一个好的解决方案吗?有什么明显的问题吗?有没有更好的方法可以做到这一点?
def repartition(*futs, size, client):
"""Repattitions the DataFrames held in `futs` into frames of size `size`.
Parameters
----------
futs : Iterable[Future] or Future
Each future holds a DataFrame with the same structure but potentially
of different sizes.
size : int
The size of each new DataFrame
client : dask.distributed.Client
The Client instance used to submit the partition jobs
Returns
-------
Iterator[Future]
As the underlying futures complete new futures are yielded which
hold the repartitioned DataFrames
"""
start = 0
# The cumulative size of the frames in the partition
cumsize = [] # List[int]
# The futures which make up the new partition
frames = [] # List[Future]
for fut in as_completed(futs):
frames.append(fut)
length = client.submit(len, fut).result()
fut.length = length
total_length = length if not cumsize else length + cumsize[-1]
cumsize.append(total_length)
if total_length < size:
continue
while cumsize[-1] >= size:
# yield chunks until the data is exhausted
yield client.submit(
partial(partition_frames, start=start, size=size),
frames
)
res = list(zip(*(
(old_size - size, fut)
for old_size, fut
in zip(cumsize, frames)
if old_size > size
)))
if not res:
cumsize, frames = [], []
break
# if there is data remaining carry it forward to the next chunk
cumsize, frames = map(list, res)
start = frames[0].length - cumsize[0]
if not cumsize:
start = 0
else:
start = length - cumsize[-1]
if frames:
yield client.submit(
partial(partition_frames, start=start, size=size),
frames
)
def partition_frames(frames, *, start=0, size=None):
import pandas as pd
df = pd.concat(frames)
end = None if size is None else start + size
return df.iloc[start:end]
该代码似乎有效:
df = pd.DataFrame(np.arange(10000), columns=['A'])
partitions = np.random.rand(50).cumsum()
partitions /= partitions[-1]
partitions = (10000*partitions).astype(int)
partitions = np.append(0, partitions)
slices = list(map(slice, partitions[:-1], partitions[1:]))
futs = [
client.submit(lambda s, df=df: df[s], s)
for s in slices
]
...但是考虑到它所做的全部工作就是连接DataFrame并返回切片,这似乎需要一段时间:
%%time
partitions = list(repartition(*futs, size=1000, client=client))
res = pd.concat(client.gather(partitions)).equals(df)
print(res)
True
Wall time: 1.21 s
这对我的应用程序来说不是问题,但这确实使我想知道是否没有更有效的方法来实现同一目标?