使用apply方法提高熊猫的表现

时间:2017-03-30 12:31:13

标签: python pandas numpy scikit-learn

我正在使用pandas进行高性能计算,下面的函数为50,000行提供 1循环,最佳5:7.24 s每循环

我必须将它扩展到100万行。

如何向量化该函数并应用于所有行。那么总体性能可以提高吗?

The conversion of the nvarchar value '17191925814' overflowed an int column.

1 个答案:

答案 0 :(得分:3)

我认为你可以删除apply并使用矢量化函数:

mutatedCashFlow['startDate'] = pd.to_datetime(mutatedCashFlow['startDate'])
mutatedCashFlow['EndDate'] = pd.to_datetime(mutatedCashFlow['EndDate'])
mutatedCashFlow['tradeDate'] = pd.to_datetime(mutatedCashFlow['tradeDate'])

diffTradeAndEnd=((mutatedCashFlow['EndDate']-mutatedCashFlow['tradeDate']).dt.days).abs()
diffStartAndEnd=((mutatedCashFlow['EndDate']-mutatedCashFlow['startDate']).dt.days).abs()

mutatedCashFlow['flow'] = (mutatedCashFlow['tradeAmount']*diffTradeAndEnd)/diffStartAndEnd

替代:

mutatedCashFlow['startDate'] = pd.to_datetime(mutatedCashFlow['startDate'])
mutatedCashFlow['EndDate'] = pd.to_datetime(mutatedCashFlow['EndDate'])
mutatedCashFlow['tradeDate'] = pd.to_datetime(mutatedCashFlow['tradeDate'])

diffTradeAndEnd=mutatedCashFlow['EndDate'].sub(mutatedCashFlow['tradeDate']).dt.days.abs()
diffStartAndEnd=mutatedCashFlow['EndDate'].sub(mutatedCashFlow['startDate']).dt.days.abs()

mutatedCashFlow['flow'] = mutatedCashFlow['tradeAmount'].mul(diffTradeAndEnd)
                                                        .div(diffStartAndEnd)
print (mutatedCashFlow)