Question

我在Pandas做的很多东西在我拥有的数据量上都有凸计算时间成本（例如1行需要1秒，2行需要2.2秒，4行需要6秒等）

为什么计算成本不会线性增加我拥有的数据量？例如，我写的这个函数：

def fractrips1brand(trip): 
    # Get number of transaction rows for THIS sepcific consumer
    art = trip[trip.Item_Id.isin(insidegood)].Item_Id.nunique()
    output = pd.Series({'numinsidegoods': art })
    return output


gr_TILPS = TILPStemp.groupby('uniqueid')
output = gr_TILPS.apply(fractrips1brand)

似乎表现出这样的代价。

为什么不是O(n)？

Answer 1

功能具有大于线性的时间复杂度是很常见的。例如，排序具有O(n log n)复杂度。

gr_TILPS = TILPStemp.groupby('uniqueid')

groupby sorts the keys by default，因此此调用的复杂度至少为O(n log n)。您可以使用

关闭排序

gr_TILPS = TILPStemp.groupby('uniqueid', sort=False)

在Pandas 0.15及更早版本中，Series.nunique (source)调用Series.value_counts (source)，默认情况下也会对值进行排序。所以这是另一个具有O（n log n）复杂度的函数调用。由于此问题发生在fractrips1brand，因此gr_TILPS.apply(fractrips1brand)的总复杂度至少为O(m n log n)，其中m是组的数量。

更新：在下一版Pandas（版本0.16.0）Series.nunique should be significantly faster中。

应用函数凸的计算成本是什么时候？

1 个答案: