应用错误收集

有效分区dask数据帧的策略

时间：2017-06-20 15:48:24

标签： python optimization dataframe dask

Dask的文档讨论了重新分配以减少开销here。

然而，它们似乎表明您需要预先知道数据帧的外观（即预期数据的1/100）。

有没有一种很好的方法可以在不做假设的情况下明智地重新分配？目前我只是使用npartitions = ncores * magic_number重新分区，并在必要时将力设置为True以扩展分区。这种尺寸适用于所有方法，但由于我的数据集大小不同，因此绝对不是最理想的。

数据是时间序列数据，但遗憾的是没有定期，我过去曾经按时间频率重新分配，但由于数据的不规则性（有时几分钟没有数千秒），这将是次优的

3 个答案:

答案 0 :(得分：5)

与mrocklin讨论后，一个合适的分区策略是针对由df.memory_usage().sum().compute()引导的100MB分区大小。使用适合RAM的数据集，可以使用放置在相关点的df.persist()来减轻可能涉及的额外工作。

答案 1 :(得分：4)

自Dask 2.0.0起，您可以致电import kotlin.reflect.KMutableProperty1 class Thing(var amount: Int, var id: Int) { fun editAttributes(editor: RemoteEdit) { val editing = editor.attributeToEdit editing.set(this, editor.newValue) } } class RemoteEdit(var attributeToEdit: KMutableProperty1<Thing, Int>, var newValue: Int) fun main() { val bananas = Thing(amount = 12, id = 21) val remoteEditor = RemoteEdit(attributeToEdit = Thing::amount, newValue = 23) bananas.editAttributes(remoteEditor) println(bananas.amount) // prints 23 }。

此方法对分区大小执行对象考虑（.repartition(partition_size="100MB")）细分。它将加入较小的分区，或者合并已变得太大的分割分区。

Dask's Documentation还概述了用法。

答案 2 :(得分：2)

只需添加到上述答案中即可：

memory_usage()默认情况下忽略对象dtype列的内存消耗。对于我最近使用的数据集，这导致低估了大约10倍的内存使用量。

除非您确定没有对象dtype列，否则我建议您指定deep=True，即使用以下方法重新分区：

df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute() // n )

n是目标分区大小（以字节为单位）。加1可确保分区数始终大于1（//执行楼层划分）。