从CSV加载数据时,无法加载某些CSV,导致分区为空。我想删除所有空分区,因为有些方法似乎不适用于空分区。我试图重新分区,例如repartition(npartitions=10)
工作,但是大于此值的值仍然可能导致空分区。
实现这一目标的最佳方法是什么?感谢。
答案 0 :(得分:3)
我发现过滤Dask数据帧(例如按日期)通常会导致空分区。如果你在使用带有空分区的数据帧时遇到问题,这里有一个基于MRocklin指导的函数来剔除它们:
def cull_empty_partitions(df):
ll = list(df.map_partitions(len).compute())
df_delayed = df.to_delayed()
df_delayed_new = list()
pempty = None
for ix, n in enumerate(ll):
if 0 == n:
pempty = df.get_partition(ix)
else:
df_delayed_new.append(df_delayed[ix])
if pempty is not None:
df = dd.from_delayed(df_delayed_new, meta=pempty)
return df
答案 1 :(得分:1)
对于使用Bags(而不是DataFrames)的任何人,此功能都能解决问题:
def cull_empty_partitions(bag):
"""
When bags are created by filtering or grouping from a different bag,
it retains the original bag's partition count, even if a lot of the
partitions become empty.
Those extra partitions add overhead, so it's nice to discard them.
This function drops the empty partitions.
"""
bag = bag.persist()
def get_len(partition):
# If the bag is the result of bag.filter(),
# then each partition is actually a 'filter' object,
# which has no __len__.
# In that case, we must convert it to a list first.
if hasattr(partition, '__len__'):
return len(partition)
return len(list(partition))
partition_lengths = bag.map_partitions(get_len).compute()
# Convert bag partitions into a list of 'delayed' objects
lengths_and_partitions = zip(partition_lengths, bag.to_delayed())
# Drop the ones with empty partitions
partitions = (p for l,p in lengths_and_partitions if l > 0)
# Convert from list of delayed objects back into a Bag.
return dask.bag.from_delayed(partitions)
答案 2 :(得分:1)
这是我删除空分区的尝试:
import numpy as np
def remove_empty_partitions(ddf):
""" remove empty partitions """
partition_lens = ddf.map_partitions(len).compute()
ids_of_empty_partitions = np.where(partition_lens==0)
if len(ids_of_empty_partitions) == len(partition_lens):
# all partitions are empty
ddf_nonzero = ddf.partitions[0]
elif len(ids_of_empty_partitions)>0:
ddf_nonzero = dd.concat([
ddf.get_partition(num_partition) for num_partition, partition in enumerate(ddf.partitions)
if num_partition not in ids_of_empty_partitions
])
return ddf_nonzero
FWIW,@tpegbert 的答案在获取过滤数据帧所需的任务数量方面似乎更有效。
答案 3 :(得分:0)
没有简单的API可以做到这一点。您可以致电df.map_partitions(len)
以确定哪些分区为空,然后明确删除它们,可能是使用df.to_delayed()
和dask.dataframe.from_delayed(...)
。
将来,如果您发现一个功能不能很好地处理空分区,那么您将非常感激。 https://github.com/dask/dask/issues/new